Prompt Optimization and DSPy

The Prompt That Broke the Team

For six months, three engineers spent approximately 40% of their time on prompt engineering. Every time the product changed, the prompts had to change. Every time a new model was released, the prompts had to be re-tuned. When the team moved from GPT-4 to Claude, they had to rewrite everything from scratch - a two-week project.

The prompts had grown organically, layer by layer. Each new edge case added a new rule. Each complaint from users added a new instruction. The system prompt was now 3,000 tokens long, contradicted itself in two places, and nobody was entirely sure why some specific phrasing was there. Removing any piece might break something.

And yet: the system still got 72% accuracy on the test set. That was the number they'd optimized to. Hand-crafted, agonized over, 40% of three engineers' time - for 72%.

A researcher on the team read the DSPy paper. Over a weekend, they built a DSPy program equivalent to the hand-crafted pipeline. After compiling with 50 labeled examples, accuracy was 79%. With 200 examples: 84%.

The prompts DSPy generated were different from the hand-crafted ones in ways nobody expected. Some rules the team thought were critical weren't there at all. Other things - phrasing they'd never have thought of - made a significant difference. Manual prompt engineering was optimizing in a small neighborhood of the solution space. DSPy searched more broadly.

This is what automated prompt optimization enables: systematic search over the prompt space, guided by a metric, at a scale that humans can't match manually.

The Problem with Manual Prompt Engineering

Manual prompt engineering has fundamental limitations:

It's brittle: Prompts optimized for GPT-4 may not transfer to Claude. Prompts that work in March may not work in June when the model is updated.

It doesn't generalize: You're optimizing for the examples you've tested. Edge cases break.

It's non-compositional: When you have a 5-step pipeline, optimizing each prompt independently doesn't optimize the pipeline as a whole. Upstream changes affect downstream behavior.

It's human-bottlenecked: Your intuitions about what makes a good prompt may be wrong. The model knows what it responds to; you can only guess.

It's not measurable: "This prompt feels better" is not an engineering metric.

The Landscape of Prompt Optimization

APE: Automatic Prompt Engineer (Zhou et al., 2022)

Automatic Prompt Engineer (Zhou et al., "Large Language Models Are Human-Level Prompt Engineers," 2022) was one of the first systematic approaches to automated prompt optimization.

The algorithm:

Given a set of (input, output) demonstrations, use an LLM to generate N candidate instruction prompts
Evaluate each candidate instruction on a held-out test set
Select the best-performing instruction
Optionally: iteratively refine by generating variations of the best instruction

import anthropic

client = anthropic.Anthropic()

class APEOptimizer:
    """
    Simplified APE (Automatic Prompt Engineer) implementation.
    Generates candidate prompts and selects the best-performing one.
    """

    def __init__(self, demonstrations: list[dict], test_set: list[dict]):
        """
        Args:
            demonstrations: list of {"input": str, "output": str} pairs for generation
            test_set: list of {"input": str, "expected_output": str} for evaluation
        """
        self.demonstrations = demonstrations
        self.test_set = test_set

    def generate_candidate_prompts(self, n_candidates: int = 5) -> list[str]:
        """Use an LLM to generate candidate instruction prompts."""
        demo_text = "\n\n".join([
            f"Input: {d['input']}\nOutput: {d['output']}"
            for d in self.demonstrations[:10]  # Use up to 10 demonstrations
        ])

        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"""I have a task where given an input, I need to produce an output.
Here are some examples:

{demo_text}

Generate {n_candidates} different instruction prompts that would cause a language model
to correctly transform inputs to outputs like in the examples above.

Each prompt should be on its own line, numbered 1-{n_candidates}.
Focus on the underlying pattern - what is this task really doing?
The prompt should be placed BEFORE the input when calling the model."""
            }]
        )

        # Parse candidate prompts from the response
        lines = message.content[0].text.strip().split('\n')
        prompts = []
        for line in lines:
            # Remove numbering and clean up
            import re
            cleaned = re.sub(r'^\d+[.)]\s*', '', line).strip()
            if cleaned and len(cleaned) > 20:
                prompts.append(cleaned)

        return prompts[:n_candidates]

    def evaluate_prompt(self, prompt: str) -> float:
        """Evaluate a prompt on the test set. Returns accuracy."""
        correct = 0
        for item in self.test_set:
            message = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=100,
                temperature=0,
                messages=[{
                    "role": "user",
                    "content": f"{prompt}\n\nInput: {item['input']}"
                }]
            )
            output = message.content[0].text.strip().lower()
            expected = item['expected_output'].strip().lower()
            if output == expected or expected in output:
                correct += 1

        return correct / len(self.test_set) if self.test_set else 0.0

    def optimize(self, n_candidates: int = 5) -> tuple[str, float]:
        """Run APE optimization. Returns (best_prompt, best_score)."""
        print(f"Generating {n_candidates} candidate prompts...")
        candidates = self.generate_candidate_prompts(n_candidates)

        print(f"Evaluating {len(candidates)} candidates on test set...")
        scores = []
        for i, prompt in enumerate(candidates):
            score = self.evaluate_prompt(prompt)
            scores.append((score, prompt))
            print(f"  Prompt {i+1}: {score:.2%} - '{prompt[:60]}...'")

        best_score, best_prompt = max(scores, key=lambda x: x[0])
        return best_prompt, best_score


# Example usage: sentiment classification
demonstrations = [
    {"input": "This product is absolutely amazing!", "output": "positive"},
    {"input": "Terrible quality, broke in one day.", "output": "negative"},
    {"input": "It's okay, does what it says.", "output": "neutral"},
    {"input": "Best purchase I've made this year!", "output": "positive"},
    {"input": "Disappointed with the build quality.", "output": "negative"},
]

test_set = [
    {"input": "Love it!", "expected_output": "positive"},
    {"input": "Waste of money.", "expected_output": "negative"},
    {"input": "Nothing special.", "expected_output": "neutral"},
]

optimizer = APEOptimizer(demonstrations, test_set)
best_prompt, best_score = optimizer.optimize(n_candidates=5)
print(f"\nBest prompt: '{best_prompt}'")
print(f"Best accuracy: {best_score:.2%}")

OPRO: Optimization by Prompting (Yang et al., 2023)

OPRO (Yang et al., "Large Language Models as Optimizers," 2023, Google DeepMind) treats the LLM itself as an optimizer. Instead of using gradient descent, OPRO uses a "meta-prompt" that shows the LLM its previous prompt attempts along with their scores, and asks it to generate a better one.

class OPROOptimizer:
    """
    OPRO: Optimization by PROmpting.
    Uses an LLM to iteratively improve prompts by showing it previous attempts.
    """

    def __init__(self, task_description: str, test_set: list[dict]):
        self.task_description = task_description
        self.test_set = test_set
        self.history: list[tuple[str, float]] = []  # (prompt, score) pairs

    def generate_next_prompt(self) -> str:
        """Generate a new prompt based on optimization history."""
        if not self.history:
            # Start with a basic prompt
            return f"Complete the following task: {self.task_description}"

        # Build the meta-prompt showing history
        history_text = "\n".join([
            f"Prompt: {p}\nScore: {s:.2%}\n"
            for p, s in sorted(self.history, key=lambda x: x[1])[-10:]  # Last 10
        ])

        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""I'm trying to find the best instruction prompt for this task:
Task: {self.task_description}

Previous attempts (sorted by score, worst to best):
{history_text}

Based on what worked and what didn't, generate a new instruction prompt that should
score higher than all previous attempts. The prompt should:
- Be clear and specific
- Build on insights from the highest-scoring attempts
- Avoid patterns from the lowest-scoring attempts

Output only the new prompt, nothing else."""
            }]
        )
        return message.content[0].text.strip()

    def evaluate(self, prompt: str) -> float:
        """Evaluate a prompt on the test set."""
        correct = 0
        for item in self.test_set:
            message = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=50,
                temperature=0,
                messages=[{
                    "role": "user",
                    "content": f"{prompt}\n\nInput: {item['input']}"
                }]
            )
            output = message.content[0].text.strip().lower()
            if item['expected_output'].lower() in output:
                correct += 1
        return correct / len(self.test_set)

    def optimize(self, n_iterations: int = 10) -> tuple[str, float]:
        """Run OPRO for n_iterations. Returns best (prompt, score)."""
        best_prompt, best_score = "", 0.0

        for i in range(n_iterations):
            prompt = self.generate_next_prompt()
            score = self.evaluate(prompt)
            self.history.append((prompt, score))

            print(f"Iteration {i+1}: score={score:.2%}, prompt='{prompt[:60]}...'")

            if score > best_score:
                best_score = score
                best_prompt = prompt

        return best_prompt, best_score

DSPy: Programs Not Prompts

APE and OPRO optimize individual prompts. DSPy (Khattab et al., "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines," 2023, Stanford) takes a fundamentally different approach: it treats the entire LLM pipeline as a program with learnable parameters.

The key insight: in a multi-step LLM pipeline, manually optimizing each prompt in isolation is suboptimal. The best prompt for step 2 depends on what step 1 produces. DSPy optimizes the whole pipeline end-to-end.

Core DSPy Concepts

Signature: Declares the input/output interface of an LLM call.

import dspy

class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")

Module: A composable unit that wraps one or more LLM calls.

class ClassifyReview(dspy.Module):
    def __init__(self):
        self.classify = dspy.Predict(SentimentClassifier)

    def forward(self, review: str) -> str:
        return self.classify(review=review).sentiment

Optimizer (Teleprompter): Automatically finds the best prompt or few-shot examples for each module.

Compiling: Running the optimizer to produce an optimized version of your program.

A Complete DSPy Example

import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate

# Configure the LM backend
# DSPy supports Claude, OpenAI, and other providers
lm = dspy.Claude(model="claude-sonnet-4-6", api_key="your-api-key")
dspy.settings.configure(lm=lm)


# Step 1: Define Signatures
class ExtractClaims(dspy.Signature):
    """Extract factual claims from a news article."""
    article: str = dspy.InputField()
    claims: list[str] = dspy.OutputField(desc="List of distinct factual claims made in the article")


class VerifyClaim(dspy.Signature):
    """Verify whether a factual claim is supported by evidence."""
    claim: str = dspy.InputField()
    context: str = dspy.InputField(desc="Evidence or context to check against")
    is_supported: bool = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="Confidence score from 0 to 1")
    reasoning: str = dspy.OutputField(desc="Brief explanation of verdict")


class GenerateSummary(dspy.Signature):
    """Summarize fact-checking results for an article."""
    article: str = dspy.InputField()
    verified_claims: list[dict] = dspy.InputField()
    summary: str = dspy.OutputField(desc="Executive summary of fact-check results")
    overall_accuracy: str = dspy.OutputField(desc="mostly accurate / mixed / mostly inaccurate")


# Step 2: Build the Pipeline as a DSPy Module
class FactCheckPipeline(dspy.Module):
    """Multi-step fact-checking pipeline using DSPy."""

    def __init__(self):
        super().__init__()
        # Each of these can be individually optimized by DSPy
        self.extract_claims = dspy.Predict(ExtractClaims)
        self.verify_claim = dspy.ChainOfThought(VerifyClaim)
        self.summarize = dspy.Predict(GenerateSummary)

    def forward(self, article: str, evidence_docs: list[str]) -> dict:
        # Step 1: Extract claims from the article
        extraction = self.extract_claims(article=article)
        claims = extraction.claims

        # Step 2: Verify each claim against evidence
        context = "\n\n".join(evidence_docs)
        verified = []
        for claim in claims[:5]:  # Limit for cost
            verification = self.verify_claim(claim=claim, context=context)
            verified.append({
                "claim": claim,
                "is_supported": verification.is_supported,
                "confidence": verification.confidence,
                "reasoning": verification.reasoning,
            })

        # Step 3: Generate overall summary
        summary_result = self.summarize(
            article=article,
            verified_claims=verified
        )

        return {
            "claims": verified,
            "summary": summary_result.summary,
            "overall_accuracy": summary_result.overall_accuracy,
        }


# Step 3: Define Training Data
trainset = [
    dspy.Example(
        article="The COVID vaccine is 95% effective according to Pfizer's trial data.",
        evidence_docs=["Pfizer's Phase 3 trial showed 95% efficacy against symptomatic COVID-19."],
        expected_accuracy="mostly accurate"
    ).with_inputs("article", "evidence_docs"),
    # ... more examples
]

# Step 4: Define Evaluation Metric
def accuracy_metric(example, prediction, trace=None) -> bool:
    """Check if the overall accuracy verdict matches expected."""
    return example.expected_accuracy in prediction.get("overall_accuracy", "").lower()


# Step 5: Compile (Optimize)
optimizer = BootstrapFewShot(
    metric=accuracy_metric,
    max_bootstrapped_demos=4,    # Max few-shot examples to add
    max_labeled_demos=8,          # Max labeled examples to use
    max_rounds=2,                 # Optimization rounds
)

# The key step: compile optimizes prompts and few-shot examples
# for the entire pipeline end-to-end
pipeline = FactCheckPipeline()
optimized_pipeline = optimizer.compile(
    student=pipeline,
    trainset=trainset,
)

# Step 6: Evaluate
evaluator = Evaluate(devset=trainset, metric=accuracy_metric, num_threads=4)
score = evaluator(optimized_pipeline)
print(f"Optimized pipeline accuracy: {score:.2%}")

# Step 7: Save and load
optimized_pipeline.save("optimized_fact_checker.json")
# loaded = FactCheckPipeline().load("optimized_fact_checker.json")

DSPy Optimizers Compared

Evaluation-Driven Prompt Engineering

DSPy formalizes a practice that every serious prompt engineer should follow regardless of whether they use DSPy: evaluation-driven development.

The workflow:

import anthropic
from dataclasses import dataclass
from typing import Callable

client = anthropic.Anthropic()

@dataclass
class EvalResult:
    prompt: str
    score: float
    n_correct: int
    n_total: int
    failures: list[dict]


def evaluate_prompt(
    prompt_template: str,
    eval_set: list[dict],
    metric: Callable[[str, str], bool],
    model: str = "claude-sonnet-4-6",
    temperature: float = 0,
) -> EvalResult:
    """
    Evaluate a prompt template against a labeled eval set.

    Args:
        prompt_template: Prompt with {input} placeholder
        eval_set: list of {"input": str, "expected": str}
        metric: Function(predicted, expected) -> bool
        model: Claude model to use
        temperature: Sampling temperature

    Returns:
        EvalResult with score and failure analysis
    """
    correct = 0
    failures = []

    for item in eval_set:
        prompt = prompt_template.format(input=item["input"])

        message = client.messages.create(
            model=model,
            max_tokens=200,
            temperature=temperature,
            messages=[{"role": "user", "content": prompt}]
        )

        predicted = message.content[0].text.strip()
        expected = item["expected"]

        if metric(predicted, expected):
            correct += 1
        else:
            failures.append({
                "input": item["input"],
                "expected": expected,
                "predicted": predicted,
            })

    score = correct / len(eval_set)
    return EvalResult(
        prompt=prompt_template,
        score=score,
        n_correct=correct,
        n_total=len(eval_set),
        failures=failures
    )


def analyze_failures(result: EvalResult) -> dict:
    """Analyze failure patterns to guide prompt improvement."""
    if not result.failures:
        return {"status": "no failures"}

    # Ask the LLM to analyze the failure patterns
    failures_text = "\n".join([
        f"Input: {f['input']}\nExpected: {f['expected']}\nGot: {f['predicted']}"
        for f in result.failures[:20]  # Analyze up to 20 failures
    ])

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Analyze these prediction failures and identify patterns:

{failures_text}

Identify:
1. The main error categories (2-4 categories max)
2. What's causing each category of error
3. Specific prompt changes that would fix each error type

Be concise and actionable."""
        }]
    )

    return {
        "failure_count": len(result.failures),
        "failure_rate": 1 - result.score,
        "analysis": message.content[0].text.strip(),
        "sample_failures": result.failures[:5],
    }


# Example: iterative prompt improvement
eval_set = [
    {"input": "Revenue declined 12% YoY but margins improved", "expected": "negative revenue, positive margins"},
    {"input": "All KPIs exceeded targets this quarter", "expected": "positive"},
    {"input": "Flat growth in core business, strong performance in new segments", "expected": "mixed"},
    # ... more examples
]

def exact_match(predicted: str, expected: str) -> bool:
    return expected.lower() in predicted.lower()

# Iteration 1
prompt_v1 = "Describe the financial performance in this statement: {input}"
result_v1 = evaluate_prompt(prompt_v1, eval_set, exact_match)
print(f"v1 score: {result_v1.score:.2%}")

# Analyze failures and improve
analysis = analyze_failures(result_v1)
print(f"Failure analysis:\n{analysis['analysis']}")

# Iteration 2: improved based on analysis
prompt_v2 = """Analyze this financial statement and classify performance as:
- positive: overall good results
- negative: overall poor results
- mixed: some good, some bad aspects

Statement: {input}

Respond with ONLY the classification: positive, negative, or mixed."""

result_v2 = evaluate_prompt(prompt_v2, eval_set, exact_match)
print(f"v2 score: {result_v2.score:.2%}")

Prompt Versioning and A/B Testing

import anthropic
import hashlib
import json
from datetime import datetime

client = anthropic.Anthropic()

class PromptRegistry:
    """Version-controlled prompt registry with A/B testing support."""

    def __init__(self):
        self.prompts: dict[str, dict] = {}

    def register(self, name: str, template: str, description: str = "") -> str:
        """Register a prompt version and return its hash."""
        prompt_hash = hashlib.md5(template.encode()).hexdigest()[:8]
        version_key = f"{name}_v{len([k for k in self.prompts if k.startswith(name)]) + 1}"

        self.prompts[version_key] = {
            "name": name,
            "template": template,
            "hash": prompt_hash,
            "description": description,
            "registered_at": datetime.utcnow().isoformat(),
            "eval_scores": {}
        }

        return version_key

    def record_eval_score(self, version_key: str, dataset: str, score: float):
        """Record evaluation results for a prompt version."""
        if version_key in self.prompts:
            self.prompts[version_key]["eval_scores"][dataset] = {
                "score": score,
                "evaluated_at": datetime.utcnow().isoformat()
            }

    def get_best_version(self, name: str, dataset: str) -> str | None:
        """Get the best-performing version for a given dataset."""
        candidates = {
            k: v for k, v in self.prompts.items()
            if v["name"] == name and dataset in v["eval_scores"]
        }
        if not candidates:
            return None
        return max(candidates, key=lambda k: candidates[k]["eval_scores"][dataset]["score"])

    def ab_test(
        self,
        version_a: str,
        version_b: str,
        test_input: str,
        traffic_split: float = 0.5
    ) -> tuple[str, str]:
        """Route traffic between two prompt versions. Returns (version_used, output)."""
        import random
        version = version_a if random.random() < traffic_split else version_b
        template = self.prompts[version]["template"]
        prompt = template.format(input=test_input)

        message = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            temperature=0,
            messages=[{"role": "user", "content": prompt}]
        )

        return version, message.content[0].text.strip()


# Usage
registry = PromptRegistry()

v1 = registry.register(
    "sentiment",
    "What is the sentiment? {input}",
    "Initial naive version"
)

v2 = registry.register(
    "sentiment",
    "Classify sentiment as positive/negative/neutral. Return only the word.\n\nText: {input}",
    "Improved with format constraint"
)

registry.record_eval_score(v1, "product_reviews", 0.71)
registry.record_eval_score(v2, "product_reviews", 0.84)

best = registry.get_best_version("sentiment", "product_reviews")
print(f"Best version: {best}")

# A/B test the two versions
for i in range(5):
    version_used, output = registry.ab_test(v1, v2, "This product is fantastic!")
    print(f"Version: {version_used}, Output: {output}")

When to Use DSPy vs. Manual Prompting

Situation	Recommendation
Prototyping, early exploration	Manual prompting - faster iteration
Single-step task, stable requirements	Manual prompting with eval set
Multi-step pipeline	DSPy - end-to-end optimization is key
Frequent model changes	DSPy - compile once per model
Less than 20 labeled examples	APE or OPRO - DSPy needs more data
50+ labeled examples	DSPy - BootstrapFewShot
200+ labeled examples	DSPy - MIPRO or BFRS for best results
Production, high-stakes task	DSPy + eval-driven development

Production Engineering Notes

1. Treat Your Eval Set as Critical Infrastructure

# Your eval set is as important as your test suite
# Version control it alongside your code

eval_set = [
    {"input": "...", "expected": "...", "category": "edge_case_ambiguous"},
    {"input": "...", "expected": "...", "category": "standard"},
    # ...
]

# Never remove examples from the eval set
# Only add - otherwise you lose regression detection

2. Monitor Prompt Performance Over Time

Models drift. Even if you don't change your prompt, model updates can change performance:

def track_prompt_performance(
    prompt: str,
    eval_set: list[dict],
    metric: Callable,
    run_id: str
) -> dict:
    result = evaluate_prompt(prompt, eval_set, metric)
    return {
        "run_id": run_id,
        "timestamp": datetime.utcnow().isoformat(),
        "score": result.score,
        "n_total": result.n_total,
    }

# Run weekly or after any model update
# Alert if score drops more than 5 percentage points

3. Keep Prompt Compilation Offline

DSPy compilation involves many LLM calls (each bootstrap iteration samples training examples). Run compilation offline:

# Run this as a CI/CD job, not in your serving path
optimized = optimizer.compile(student=pipeline, trainset=train_data)
optimized.save("optimized_pipeline_v3.json")

# Serving: just load
serving_pipeline = FactCheckPipeline()
serving_pipeline.load("optimized_pipeline_v3.json")

Common Mistakes

:::danger Mistake 1: No Eval Set You cannot do prompt engineering without a labeled eval set. "It looks good" is not a metric. Before writing a single prompt, define your success metric and collect 50-200 labeled examples. :::

:::danger Mistake 2: Optimizing Against Your Eval Set If you manually inspect failures and tune the prompt to fix them, then evaluate on the same set - you're overfitting. Hold out 20% of examples as a true test set, and tune only on the remaining 80%. :::

:::warning Mistake 3: Using DSPy for Simple Tasks DSPy adds complexity and requires a labeled dataset to compile. For a single-step task with stable requirements, a well-crafted manual prompt with an eval set is often better. Use DSPy when the pipeline complexity justifies it. :::

:::warning Mistake 4: Not Monitoring After Deployment Model providers update their models. Your prompt's performance can change without you changing anything. Set up periodic eval runs and alert on score degradation. :::

Interview Q&A

Q1: What is the core problem that automated prompt optimization solves?

Manual prompt engineering is a narrow, human-guided search in a vast space. Humans are limited by their intuitions about what makes a good prompt - which are often wrong. Manual optimization is brittle (changes in model, task, or data break hand-crafted prompts), non-compositional (optimizing each prompt in a pipeline independently doesn't optimize the pipeline globally), and doesn't scale with pipeline complexity. Automated prompt optimization - APE, OPRO, DSPy - replaces intuition-driven search with systematic, metric-driven optimization, searching more broadly and finding prompts that humans wouldn't have written.

Q2: How does APE work and what are its limitations?

APE (Automatic Prompt Engineer, Zhou et al. 2022) uses an LLM to generate candidate instruction prompts from demonstrations, evaluates each on a test set, and selects the best. It can optionally iterate by generating variations of the top candidates. Limitations: (1) It optimizes a single instruction, not the whole pipeline; (2) It requires a test set for evaluation; (3) It doesn't optimize few-shot examples - only the instruction; (4) The quality of generated candidates depends on the quality of demonstrations provided; (5) It's computationally expensive - evaluating N candidates requires N × |test_set| LLM calls.

Q3: What is DSPy's key innovation over APE and manual prompting?

DSPy treats the entire LLM pipeline as a program where prompts and few-shot examples are learnable parameters, not fixed strings. The key innovations: (1) Signatures: declarative input/output contracts that separate task specification from implementation; (2) Modules: composable building blocks (Predict, ChainOfThought, ReAct) that can be individually optimized; (3) Compiling: end-to-end optimization of the whole pipeline - not just individual prompts - guided by a metric evaluated on labeled data; (4) Portability: compile once, re-compile when you change models. This enables systematic improvement of complex pipelines without manual prompt tuning.

Q4: What is DSPy's BootstrapFewShot optimizer and how does it work?

BootstrapFewShot automatically selects the best few-shot examples for each module in your DSPy program. It runs the unoptimized program on training examples, collects the (input, output) pairs where the full pipeline succeeded (as measured by your metric), and uses those as few-shot demonstrations in the prompts. This is "bootstrapping" - using successful pipeline runs to generate examples. Unlike manual few-shot selection, it: considers the full pipeline (what examples help the whole system, not just individual modules), is data-driven (examples are selected by their actual impact on the metric), and scales automatically with more training data.

Q5: How do you set up evaluation-driven prompt engineering for a production task?

The workflow: (1) Define a clear success metric - binary (correct/incorrect), score-based, or LLM-judged; (2) Collect 50-200 labeled examples - split 80/20 into development and test sets; (3) Write an initial baseline prompt; (4) Evaluate on the development set; (5) Analyze failures - categorize error types to understand what's wrong; (6) Improve the prompt based on failure analysis; (7) Repeat until dev set score exceeds threshold; (8) Evaluate on test set to confirm no overfitting; (9) Deploy with ongoing monitoring - run eval weekly and alert on regressions. Never evaluate on the same examples you used to debug the prompt.

Q6: When would you use OPRO instead of DSPy for prompt optimization?

OPRO (Yang et al. 2023) is better when: you have a single-step task (not a pipeline), you have fewer than 50 labeled examples (DSPy works better with more data), you want a simple optimization loop without the full DSPy framework, or you want to understand the optimization process at a mechanistic level. OPRO's key advantage: it's simple - you just show the model its previous attempts and scores and ask it to improve. DSPy is better when: you have a multi-step pipeline, you want to optimize few-shot examples AND instructions jointly (MIPRO), you have enough labeled data (50+), or you want the robustness that comes with formal module composition and compilation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the DSPy: Automatic Prompt Optimization demo on the EngineersOfAI Playground - no code required.

:::

The Prompt That Broke the Team​

The Problem with Manual Prompt Engineering​

The Landscape of Prompt Optimization​

APE: Automatic Prompt Engineer (Zhou et al., 2022)​

OPRO: Optimization by Prompting (Yang et al., 2023)​

DSPy: Programs Not Prompts​

Core DSPy Concepts​

A Complete DSPy Example​

DSPy Optimizers Compared​

Evaluation-Driven Prompt Engineering​

Prompt Versioning and A/B Testing​

When to Use DSPy vs. Manual Prompting​

Production Engineering Notes​

1. Treat Your Eval Set as Critical Infrastructure​

2. Monitor Prompt Performance Over Time​

3. Keep Prompt Compilation Offline​

Common Mistakes​

Interview Q&A​