Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the In-Context Learning demo on the EngineersOfAI Playground - no code required. :::

Prompt Design Fundamentals

The Configuration Panel That Nobody Readโ€‹

In the early days of deploying GPT-3 internally, a financial services firm ran an experiment. They gave the same task to two groups of engineers: write a prompt that extracts key financial metrics from earnings call transcripts. After two weeks, they collected 23 different prompts. They ran all 23 against a standardized test set of 50 transcripts.

The best prompt achieved 94% accuracy. The worst achieved 31%. The task was identical. The data was identical. The model was identical. The only variable was how the engineers wrote their instructions.

When they analyzed the difference, the pattern was stark. The best-performing prompts shared four characteristics: they specified the exact output format (JSON with named fields), they gave context before the task ("You are analyzing earnings calls for institutional investorsโ€ฆ"), they told the model what NOT to include (no hedging language, no analysis beyond what's stated), and they included a single concrete example of a correct extraction. The worst-performing prompts were one or two sentences of vague instruction.

Prompts are executable specifications for non-deterministic systems. A specification that works 31% of the time isn't a specification - it's a suggestion. The fundamentals of prompt design are about turning suggestions into reliable, production-grade specifications.

Why This Exists: The Ambiguity Problemโ€‹

When you write code, the machine executes exactly what you specify. x = 5 + 3 produces 8. Every time. When you write a prompt, the model interprets your natural language and generates a probabilistic response. The same 12-word prompt can produce outputs that look wildly different depending on subtle word choices, framing, and context.

The core problem is that natural language is deeply ambiguous. Consider: "Summarize this document." What length? What format? For what audience? What level of technical detail? From what perspective? None of this is specified. The model fills in these gaps from its training distribution - which may or may not match your intent.

The history of LLM deployment is a history of teams discovering this gap between intent and instruction. Early users of the OpenAI API reported that responses were inconsistent, hallucinated, or weirdly formatted. The models were capable - the instructions just weren't precise enough. Prompt design fundamentals address this systematically: principles that convert ambiguous natural language instructions into precise, reliable specifications that the model can execute consistently.

Historical Contextโ€‹

Early generative models (pre-2020) were primarily fine-tuned on specific tasks - you didn't prompt them, you fine-tuned them. GPT-3 (Brown et al., 2020) changed everything by demonstrating that a single large language model could perform diverse tasks through in-context learning - instructions in the prompt rather than weights from fine-tuning.

This was liberating but also challenging: the same capability that made GPT-3 flexible made it unpredictable. The "prompt engineering" discipline emerged organically as researchers and practitioners discovered that prompt phrasing dramatically affected output quality. Papers like "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022) and "Large Language Models Are Zero-Shot Reasoners" (Kojima et al., 2022) codified specific techniques. But the fundamentals - basic principles of clarity, specificity, and structure - were discovered through practice long before they were written up academically.

By 2023, "prompt engineering" had become standard terminology. By 2024, it was a recognized engineering discipline with associated tooling, testing frameworks, and production practices. The fundamentals covered here are the stable foundation beneath all of this - principles that apply regardless of which model you use or which advanced technique you layer on top.

Principle 1: Specify the Output Format Explicitlyโ€‹

The single highest-leverage change you can make to any prompt is specifying the exact output format. If you need JSON, say so - and show the schema. If you need a bulleted list, say so. If you need a three-paragraph essay with specific sections, say so. The model will default to whatever format seems most natural for the task, which may not be what you need.

Why it matters: Downstream code that parses model outputs is brittle. If you're extracting entities and the model sometimes returns JSON and sometimes returns prose, your parser breaks. Explicit format specification makes output parsing reliable.

import anthropic
import json

client = anthropic.Anthropic()

# โŒ BAD: No format specification - you don't know what you'll get
def extract_financial_metrics_bad(transcript: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Extract the key financial metrics from this earnings call transcript:\n\n{transcript}"
}]
)
return response.content[0].text
# Returns: "The company reported revenue of $2.3B, up 12%..."
# Completely unparseable - different format every time


# โœ… GOOD: Explicit JSON schema with field names and types
def extract_financial_metrics_good(transcript: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Extract key financial metrics from this earnings call transcript.

Return a JSON object with exactly these fields:
{{
"revenue_usd_millions": number or null,
"revenue_growth_yoy_pct": number or null,
"gross_margin_pct": number or null,
"operating_income_usd_millions": number or null,
"eps_diluted": number or null,
"guidance_revenue_next_quarter_usd_millions": number or null,
"guidance_provided": boolean
}}

Rules:
- Use null for any metric not mentioned in the transcript
- All monetary values in USD millions
- Growth percentages as decimals (12% = 12.0, not 0.12)
- Do not include fields not in this schema
- Return only the JSON object, no explanation

Transcript:
{transcript}"""
}]
)

raw = response.content[0].text.strip()
# Strip markdown code blocks if present
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return json.loads(raw.strip())

Format Specification Patterns by Use Caseโ€‹

import anthropic
import json

client = anthropic.Anthropic()

# Pattern 1: Classification - constrained vocabulary
def classify_sentiment(text: str) -> str:
"""Tight output constraint - one of four exact values."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=20,
messages=[{
"role": "user",
"content": f"""Classify the sentiment of this text.

Output exactly one of: POSITIVE, NEGATIVE, NEUTRAL, MIXED
Do not include any other text. Do not include punctuation.

Text: {text}
Classification:"""
}]
)
result = response.content[0].text.strip().upper()
assert result in {"POSITIVE", "NEGATIVE", "NEUTRAL", "MIXED"}, f"Unexpected: {result}"
return result


# Pattern 2: Ranked list with structured fields
def analyze_risks(scenario: str) -> list[dict]:
"""Return a structured list with typed fields for each item."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=800,
messages=[{
"role": "user",
"content": f"""Identify the top 5 risks in this scenario.

Return a JSON array. Each item must have exactly these fields:
{{
"risk_name": "5 words max",
"description": "one sentence, specific",
"severity": "LOW|MEDIUM|HIGH|CRITICAL",
"likelihood": "LOW|MEDIUM|HIGH",
"mitigation": "one concrete action step"
}}

Sort by severity (CRITICAL first).
Return only the JSON array, no explanation.

Scenario: {scenario}"""
}]
)
raw = response.content[0].text.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return json.loads(raw.strip())


# Pattern 3: Multi-section document with strict headers
def generate_technical_summary(paper_abstract: str) -> str:
"""Use exact headers to enforce structure."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=800,
messages=[{
"role": "user",
"content": f"""Summarize this research paper abstract for a technical audience.

Use EXACTLY this structure with these exact headers:

## Core Contribution
[1-2 sentences: what new thing does this paper contribute?]

## Method
[2-3 sentences: how does their approach work at a high level?]

## Key Results
[Bullet list: 3-5 specific results with numbers where available]
-
-

## Limitations
[1-2 sentences: what are the main limitations or caveats?]

## Practitioner Takeaway
[1 sentence: what should an ML engineer take away from this?]

Abstract:
{paper_abstract}"""
}]
)
return response.content[0].text


# Pattern 4: Table output
def compare_options(options: list[str], criteria: list[str]) -> str:
"""Force table output for comparison tasks."""
options_text = "\n".join(f"- {o}" for o in options)
criteria_text = "\n".join(f"- {c}" for c in criteria)

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=600,
messages=[{
"role": "user",
"content": f"""Compare these options across all criteria.

Options:
{options_text}

Criteria to compare across:
{criteria_text}

Format as a markdown table with options as columns and criteria as rows.
Use โœ… = Good, โš ๏ธ = Average, โŒ = Poor for each cell.
Add a "Winner" row at the bottom identifying the best option for each criterion.
After the table, add a 1-sentence recommendation."""
}]
)
return response.content[0].text

Principle 2: Context Before Taskโ€‹

Tell the model who it is, what it knows, and what situation it's in BEFORE telling it what to do. Models process left-to-right - context at the start of the prompt shapes how all subsequent instructions are interpreted.

Why it matters: The same task instruction produces different outputs depending on the context that precedes it. "Answer this question" means something very different to "You are a 1st-grade teacher" versus "You are a quantum physicist."

import anthropic

client = anthropic.Anthropic()

# โŒ BAD: Task first, context trailing and weak
def explain_concept_bad(concept: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=400,
messages=[{
"role": "user",
"content": f"Explain {concept}. Keep it simple for a non-technical audience."
}]
)
return response.content[0].text
# "Non-technical" is vague. Model picks its own interpretation of "simple."


# โœ… GOOD: Full context block before the task
def explain_concept_good(concept: str, audience_type: str) -> str:

AUDIENCE_CONTEXTS = {
"elementary": """You are a primary school teacher who specializes in making science concepts
accessible to 8-10 year olds. Your style:
- Vocabulary: only words a 10-year-old knows
- Analogies: toys, food, sports, animals, everyday home objects
- Sentence length: short (under 15 words each)
- Structure: hook โ†’ explanation โ†’ "so what?"
- No jargon. If you must use a technical word, define it immediately.""",

"undergraduate": """You are a computer science professor teaching undergraduates.
Your style:
- Assume basic programming knowledge (loops, functions, data structures)
- Use technical terms but define new ones on first use
- Include a short pseudocode or code example when it helps
- Mathematical notation is fine when helpful
- Expect the reader to re-read if needed""",

"expert": """You are writing a technical reference document for domain experts.
Your style:
- Use precise technical terminology without definition (experts know it)
- Prioritize accuracy over accessibility
- Include nuance, edge cases, and caveats
- Citations or references to foundational papers are appropriate
- Assume the reader has a PhD or equivalent practical depth""",
}

system_context = AUDIENCE_CONTEXTS.get(audience_type, AUDIENCE_CONTEXTS["undergraduate"])

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=600,
system=system_context,
messages=[{
"role": "user",
"content": f"Explain: {concept}"
}]
)
return response.content[0].text


# Pattern: Full context block structure - the "6-part context header"
def build_context_header(
role: str,
domain_knowledge: str,
current_situation: str,
task: str,
output_requirements: str,
constraints: str,
) -> str:
"""
The 6-part context header ensures the model has everything it needs
before it sees the actual task instruction.
"""
return f"""## Role
{role}

## Domain Knowledge
{domain_knowledge}

## Current Situation
{current_situation}

## Task
{task}

## Output Requirements
{output_requirements}

## Constraints
{constraints}"""


# Real-world example: customer support agent context
support_system = build_context_header(
role="""You are a tier-2 technical support specialist for Acme Cloud Platform.
You have 5+ years of experience with distributed systems, API integrations, and enterprise customer issues.
You are empathetic, precise, and solutions-oriented.""",

domain_knowledge="""Acme Platform: cloud-based API management and integration platform.
Current version: 3.2.1 (released Feb 2026).
Known bugs in 3.2.0: webhook delivery delays under high load (FIXED in 3.2.1).
Common issues: rate limiting (429 errors), token expiry (401), webhook misconfiguration.
Escalation path: tier-2 โ†’ engineering (for reproduction bugs) โ†’ customer success (for contract/billing).""",

current_situation="A customer has submitted a support ticket and needs help.",

task="""Read the customer's message carefully.
Identify the exact issue they're experiencing.
Provide a step-by-step resolution or clear escalation path.""",

output_requirements="""Structure your response as:
**Issue Identified**: [what you think the problem is]
**Likely Cause**: [why this is happening, be specific]
**Resolution Steps**: [numbered, specific, actionable steps]
**If That Doesn't Work**: [escalation or alternative]""",

constraints="""- Never guess at root causes without stating your uncertainty
- Don't promise timelines you can't guarantee
- If the issue requires engineering access, say so and open a ticket
- Don't share internal Jira ticket IDs or engineering plans
- Don't discuss competitors"""
)

Principle 3: Task Decomposition - Explicit Steps for Complex Tasksโ€‹

When a task requires multiple distinct cognitive operations (analyze, then classify, then generate), spell out each step explicitly. Don't ask the model to do everything in one leap.

Why it matters: LLMs process tokens left-to-right. Each token is generated based on what came before. For multi-step tasks, spelling out the steps forces the model to generate intermediate reasoning that each subsequent step can attend to. Without explicit steps, the model jumps to a conclusion that may skip critical intermediate reasoning.

import anthropic
import json

client = anthropic.Anthropic()


# Pattern 1: Inline step-by-step (single call, explicit reasoning trace)
def analyze_business_case(business_data: str, question: str) -> str:
"""Force the model to show each reasoning step inline."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=800,
messages=[{
"role": "user",
"content": f"""Analyze this business scenario to answer the question.
Follow these steps in order, labeling each:

Step 1 - Identify the key facts: List only the facts explicitly stated in the data.
Step 2 - Identify what's missing: What information would we need that we don't have?
Step 3 - Identify assumptions: What assumptions must we make to proceed?
Step 4 - Analyze: Apply the facts and assumptions to reason through the question.
Step 5 - Conclude: State your answer clearly. Rate your confidence: HIGH/MEDIUM/LOW and explain why.

Business data: {business_data}
Question: {question}"""
}]
)
return response.content[0].text


# Pattern 2: Sequential calls (chained - each step's output feeds the next)
def write_marketing_copy_staged(product: str, target_audience: str, channel: str) -> dict:
"""Multi-call pipeline: research โ†’ angle โ†’ copy โ†’ polish."""

# Stage 1: Research the audience
audience_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=400,
messages=[{
"role": "user",
"content": f"""Describe the target audience for marketing purposes.

Product: {product}
Target audience: {target_audience}
Channel: {channel}

Provide:
1. Primary pain points (3-5 bullet points)
2. Primary desires/aspirations (3-5 bullet points)
3. Language they use to describe their problems (actual phrases, 5-8 examples)
4. What makes them skeptical of solutions like this product
5. What would make them trust a solution"""
}]
)
audience_profile = audience_response.content[0].text

# Stage 2: Identify the best angle
angle_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Based on this audience profile, identify the most compelling marketing angle.

Product: {product}
Channel: {channel}
Audience profile:
{audience_profile}

Propose 3 distinct angles:
1. [Angle name]: [one sentence summary] - [why this would resonate]
2. [Angle name]: [one sentence summary] - [why this would resonate]
3. [Angle name]: [one sentence summary] - [why this would resonate]

Then recommend one and explain your reasoning in 2-3 sentences."""
}]
)
angle = angle_response.content[0].text

# Stage 3: Write the copy
copy_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=600,
messages=[{
"role": "user",
"content": f"""Write marketing copy for this product.

Product: {product}
Channel: {channel}
Audience profile: {audience_profile[:500]}...
Chosen angle: {angle}

Write complete {channel} copy. Match the length and format appropriate for {channel}.
Use the language this audience actually uses. Be specific, not generic."""
}]
)
copy = copy_response.content[0].text

# Stage 4: Polish and critique
final_response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheaper model for polishing
max_tokens=600,
messages=[{
"role": "user",
"content": f"""Polish and strengthen this marketing copy.

Original copy:
{copy}

Product: {product}
Target audience: {target_audience}

Critique: What's weak, generic, or unconvincing? What's strong?
Then rewrite with those improvements.
Format:
CRITIQUE: [2-3 sentences]
POLISHED COPY:
[rewritten copy]"""
}]
)
final = final_response.content[0].text

return {
"audience_profile": audience_profile,
"angle_selection": angle,
"initial_copy": copy,
"final_copy": final,
}

Principle 4: Negative Constraints - Tell It What NOT to Doโ€‹

Positive instructions tell the model what to do. Negative constraints tell it what to avoid. Both are necessary. Without explicit prohibitions, models will do anything not explicitly forbidden - which can be a lot.

Why it matters: Models have strong learned patterns from training. If you ask for a product recommendation without constraints, you'll get hedging language, caveats, "it depends on your needs," and suggestions to consult professionals - because that's what helpful responses look like in the training data. Negative constraints override these defaults.

import anthropic

client = anthropic.Anthropic()

# Pattern: Comprehensive DO / DO NOT structure
def generate_customer_email(
situation: str,
customer_name: str,
product: str,
resolution: str,
) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Write a customer service email for this situation.

Situation: {situation}
Customer name: {customer_name}
Product: {product}
Resolution offered: {resolution}

DO:
- Be warm, empathetic, and professional
- Clearly state what happened and what we're doing about it
- Include a specific next step the customer should take
- Acknowledge how this affected them before explaining what we're doing
- End with a genuine expression of appreciation for their patience

DO NOT:
- Use corporate jargon ("per our conversation", "as per", "going forward", "circle back")
- Use hollow apologies without substance ("We apologize for any inconvenience caused")
- Use exclamation points (they read as insincere in serious situations)
- Offer explanations that implicitly shift blame to the customer
- Use passive voice to avoid accountability ("mistakes were made", "an error occurred")
- Promise specific timelines you can't guarantee
- Start with "I hope this email finds you well" or any filler opener
- Use the word "unfortunately" - it signals bad news before the customer is ready

Write the email now:
Subject line: [write a clear, specific subject]

Body:
[write the full email]"""
}]
)
return response.content[0].text


# Pattern: Reusable constraint sets
CONSTRAINT_SETS = {
"technical_writing": [
"Do not use passive voice - always use active voice",
"Do not start sentences with 'The' more than once per paragraph",
"Do not use adverbs ending in -ly (quickly, easily, simply) - show through specifics instead",
"Do not hedge with 'might', 'could potentially', 'in some cases' unless genuinely uncertain",
"Do not repeat the same point in different words - say it once, clearly",
"Do not use 'very', 'quite', 'rather', or 'quite' as qualifiers",
],
"customer_facing": [
"Do not use acronyms without defining them on first use",
"Do not mention competitor products by name",
"Do not discuss internal processes, team names, or organizational structure",
"Do not make promises about future features or product timelines",
"Do not use technical jargon the customer hasn't used themselves",
"Do not be dismissive of the customer's pain even if the issue is minor",
],
"financial_analysis": [
"Do not state projections as facts - always qualify with probability language",
"Do not omit uncertainty or risk from any forward-looking statement",
"Do not compare different time periods without explicitly noting the basis of comparison",
"Do not use round numbers without acknowledging they are approximations",
"Do not attribute causality where only correlation is shown",
],
"code_review": [
"Do not nitpick style over substance - focus on correctness, security, and performance",
"Do not suggest refactors unless they fix a real problem",
"Do not leave comments without explaining why something is a problem",
"Do not flag issues without proposing a concrete alternative",
],
}


def apply_constraint_sets(task: str, constraint_set_names: list[str]) -> str:
all_constraints = []
for name in constraint_set_names:
if name in CONSTRAINT_SETS:
all_constraints.extend(CONSTRAINT_SETS[name])

constraint_text = "\n".join(f"- {c}" for c in all_constraints)
return f"""{task}

CONSTRAINTS (violating any of these is a failure):
{constraint_text}"""

Principle 5: The Escape Hatch - Handle the Unknown Gracefullyโ€‹

Always give the model an explicit path for when it doesn't know something. Without an escape hatch, models will hallucinate rather than admit uncertainty. With one, they can acknowledge gaps honestly.

Why it matters: LLMs have a trained disposition to produce complete, confident-sounding answers. If there's no explicit option to say "I don't know" or return null, the model fills in the gap - sometimes confidently and incorrectly. An escape hatch makes uncertainty visible in the output instead of hidden inside a confident-sounding hallucination.

import anthropic
import json

client = anthropic.Anthropic()

# โŒ BAD: No escape hatch - model invents when uncertain
def extract_contract_dates_bad(contract_text: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Extract: start date, end date, renewal date, termination notice period from this contract:\n{contract_text}"
}]
)
return response.content[0].text
# If renewal date isn't in the contract, model may invent a plausible date


# โœ… GOOD: Explicit null escape hatch + confidence scoring
def extract_contract_dates_good(contract_text: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=400,
messages=[{
"role": "user",
"content": f"""Extract dates from this contract.

Return JSON with exactly these fields:
{{
"start_date": "YYYY-MM-DD or null if not explicitly stated",
"end_date": "YYYY-MM-DD or null if not explicitly stated",
"renewal_date": "YYYY-MM-DD or null if not explicitly stated",
"termination_notice_days": integer or null if not stated,
"auto_renews": boolean or null if not stated,
"confidence": "HIGH|MEDIUM|LOW",
"notes": "any ambiguity or important caveats"
}}

CRITICAL RULES:
- Use null for any field not explicitly stated in the contract
- Do NOT calculate or infer dates that aren't written in the text
- If dates are ambiguous (e.g., "one year from signing" without a signing date), use null and explain in notes
- If the document is not a contract, return {{"error": "not_a_contract"}}

Contract:
{contract_text}"""
}]
)
raw = response.content[0].text.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return json.loads(raw.strip())


# Pattern: Multi-level confidence with explicit uncertainty expression
def answer_technical_question(question: str, context: str = "") -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Answer this technical question.

Question: {question}
{f"Context: {context}" if context else ""}

Return JSON:
{{
"answer": "your answer",
"confidence": "CERTAIN|HIGH|MEDIUM|LOW|UNCERTAIN",
"reasoning": "how you arrived at this answer",
"assumptions": ["list of assumptions you made, if any"],
"what_i_dont_know": "what information would change this answer",
"sources": ["any specific knowledge bases you drew from"]
}}

If you genuinely don't know the answer, set confidence to UNCERTAIN and explain what you do know and what you don't."""
}]
)
return json.loads(response.content[0].text.strip())

Principle 6: Structural Markers - XML Tags for Complex Promptsโ€‹

For prompts with multiple distinct sections (instructions, context, examples, input data), use XML tags to create clear structural boundaries. Models are trained extensively on HTML and XML, making them highly sensitive to these markers.

Why it matters: In long prompts, the model can lose track of what's instruction vs. data vs. example. XML tags create unambiguous structure that the model reliably recognizes and respects. This is especially critical for RAG prompts (separating retrieved documents from instructions) and few-shot prompts (separating examples from actual input).

import anthropic
import json

client = anthropic.Anthropic()

# Pattern: Full XML-structured prompt
def analyze_customer_feedback(feedback: str, product_context: str) -> dict:
"""
XML tags separate: instructions, product context, examples, schema, and actual input.
The model reliably treats each section according to its tag.
"""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=600,
messages=[{
"role": "user",
"content": f"""<instructions>
Analyze customer feedback about our product. Classify the sentiment, identify the main topic, extract actionable issues, and assess urgency. Return a JSON object matching the provided schema.
</instructions>

<product_context>
{product_context}
</product_context>

<examples>
<example>
<input>The export feature constantly crashes my browser when I have more than 200 rows. I've lost work 3 times this week.</input>
<output>{{
"sentiment": "NEGATIVE",
"topic": "export_functionality",
"urgency": "HIGH",
"bug_report": true,
"actionable_issues": ["Browser crash on export with 200+ rows", "Risk of data loss"],
"requires_engineering": true
}}</output>
</example>
<example>
<input>Love the new dashboard layout! Much cleaner than before. The dark mode is a nice touch too.</input>
<output>{{
"sentiment": "POSITIVE",
"topic": "ui_design",
"urgency": "LOW",
"bug_report": false,
"actionable_issues": [],
"requires_engineering": false
}}</output>
</example>
<example>
<input>The API documentation says the rate limit is 100 req/min but I'm getting 429 errors at 60 req/min. Is the docs wrong or is there a different limit for my plan?</input>
<output>{{
"sentiment": "NEGATIVE",
"topic": "api_rate_limiting",
"urgency": "MEDIUM",
"bug_report": true,
"actionable_issues": ["Rate limit discrepancy between docs and actual behavior", "Customer blocked from using their full allocation"],
"requires_engineering": true
}}</output>
</example>
</examples>

<output_schema>
{{
"sentiment": "POSITIVE|NEGATIVE|NEUTRAL|MIXED",
"topic": "export|import|ui_design|performance|billing|authentication|api|documentation|other",
"urgency": "LOW|MEDIUM|HIGH|CRITICAL",
"bug_report": boolean,
"actionable_issues": list of strings (empty list if none),
"requires_engineering": boolean
}}
</output_schema>

<customer_feedback>
{feedback}
</customer_feedback>

Return only the JSON object. No explanation."""
}]
)
raw = response.content[0].text.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return json.loads(raw.strip())


# Pattern: XML tags in RAG prompts - cleanest possible separation
def answer_with_rag_context(
question: str,
retrieved_docs: list[dict],
conversation_history: list[dict],
) -> str:
"""
XML tags create absolute separation between:
- The instruction layer (what to do)
- Retrieved documents (evidence to use)
- Conversation history (context)
- The current question (what to answer)
"""

docs_xml = "\n".join([
f"""<document id="{i+1}" source="{doc.get('source', 'unknown')}">
{doc['content']}
</document>"""
for i, doc in enumerate(retrieved_docs)
])

history_xml = "\n".join([
f"<message role='{msg['role']}'>{msg['content']}</message>"
for msg in conversation_history[-6:]
])

return f"""<system_instructions>
You are a precise technical assistant. Answer questions using ONLY the evidence in the retrieved documents below. Cite specific documents using [doc:N] notation. If the answer is not in the documents, say so explicitly - do not use your general knowledge.
</system_instructions>

<retrieved_documents>
{docs_xml}
</retrieved_documents>

<conversation_history>
{history_xml}
</conversation_history>

<current_question>
{question}
</current_question>

Answer the question using only the retrieved documents. Cite your sources."""

Principle 7: Length Controlโ€‹

Specify response length. The default length varies by model and task phrasing - leaving it unspecified produces wildly inconsistent output lengths and wastes tokens.

import anthropic

client = anthropic.Anthropic()

LENGTH_SPECS = {
"one_sentence": {
"instruction": "Respond in exactly one sentence. No preamble, no follow-up.",
"max_tokens": 60,
},
"brief": {
"instruction": "Respond in 2-3 sentences maximum. Lead with the most important information.",
"max_tokens": 150,
},
"concise": {
"instruction": "Respond in 1-2 short paragraphs. Get to the point immediately. No throat-clearing.",
"max_tokens": 300,
},
"standard": {
"instruction": "Provide a thorough answer. 3-5 paragraphs. Use examples where they clarify.",
"max_tokens": 700,
},
"comprehensive": {
"instruction": "Provide a comprehensive, reference-quality answer. Include edge cases, nuance, and concrete examples. Use headers if the topic warrants it.",
"max_tokens": 1500,
},
}


def controlled_response(
question: str,
length_mode: str = "concise",
format_type: str = "prose",
) -> str:
spec = LENGTH_SPECS.get(length_mode, LENGTH_SPECS["concise"])

format_instructions = {
"prose": "",
"bullets": "Format as bullet points, not prose.",
"numbered": "Format as a numbered list.",
"table": "Format as a markdown table if applicable.",
}.get(format_type, "")

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=spec["max_tokens"],
messages=[{
"role": "user",
"content": f"""{question}

Length: {spec['instruction']}
{format_instructions}""".strip()
}]
)
return response.content[0].text


# Anti-pattern: "Be concise" without definition
# โœ… Better: "Keep your response under 3 sentences"
# โœ… Better: "Answer in under 100 words"
# โœ… Better: "Match the length of the example: [example of correct length]"

Principle 8: Examples Beat Instructionsโ€‹

When in doubt, show rather than tell. A single well-chosen example is worth more than a paragraph of instructions. Examples do two things: they demonstrate the correct output format AND they calibrate the model's interpretation of what "correct" means.

import anthropic

client = anthropic.Anthropic()

# โŒ BAD: Instructions only - "simple" is undefined
def simplify_text_bad(technical: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Simplify this technical text for a non-technical audience: {technical}"
}]
)
return response.content[0].text
# "Simple" means different things to different people/models


# โœ… GOOD: Example defines the quality bar exactly
def simplify_text_good(technical: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Rewrite technical text so anyone can understand it. Match the quality shown in this example.

Example:
Original: "The microservices architecture implements horizontal scaling through containerized deployments orchestrated by Kubernetes, enabling elastic resource allocation based on real-time demand signals."
Rewrite: "Instead of one big program, our system uses dozens of smaller programs that work together. When things get busy, we automatically start up more copies of whichever parts need help - like opening more checkout lanes at a grocery store when lines get long."

Notice in the example: no jargon, everyday analogies, same core meaning, shorter.

Now rewrite this:
Original: {technical}
Rewrite:"""
}]
)
return response.content[0].text


# Pattern: Calibration examples - show the quality bar explicitly
def classify_with_examples(text: str, label_type: str) -> str:
EXAMPLE_SETS = {
"urgency": [
("Server is completely down. 500 customers can't log in. We're losing $50k/hour.", "CRITICAL"),
("The export button doesn't work in Firefox. Works fine in Chrome.", "MEDIUM"),
("Could you add a dark mode option? Would be a nice quality of life improvement.", "LOW"),
("Getting intermittent 504 timeouts on the API, maybe 1 in 20 requests. Affecting some workflows.", "HIGH"),
],
"sentiment": [
("This is exactly what I needed. Saved me hours of work.", "POSITIVE"),
("It's fine. Does what it says. Nothing special.", "NEUTRAL"),
("Been waiting 3 weeks for this bug to be fixed. Completely unacceptable.", "NEGATIVE"),
("Love the new features but the performance got way worse. Mixed feelings.", "MIXED"),
],
}

examples = EXAMPLE_SETS.get(label_type, [])
examples_text = "\n".join([f'Text: "{ex[0]}"\nLabel: {ex[1]}' for ex in examples])

response = client.messages.create(
model="claude-opus-4-6",
max_tokens=20,
messages=[{
"role": "user",
"content": f"""Classify the text using these examples as a reference:

{examples_text}

Text: "{text}"
Label:"""
}]
)
return response.content[0].text.strip()

The Complete Prompt Quality Checklistโ€‹

from dataclasses import dataclass, field


@dataclass
class PromptChecklist:
"""
Systematic quality checklist for any production prompt.
Score 0.0-1.0. Don't deploy below 0.7.
"""
# Format & Structure
output_format_explicit: bool = False # JSON schema, exact headers, etc.
context_precedes_task: bool = False # Role/background BEFORE instructions
uses_structural_markers: bool = False # XML tags for complex multi-section prompts
steps_explicit_for_complex: bool = False # Multi-step tasks have numbered steps

# Constraint Quality
has_do_not_list: bool = False # Explicit prohibitions
has_escape_hatch: bool = False # null/not_found for unknowns
length_specified: bool = False # Concrete length instruction

# Example Quality
has_at_least_one_example: bool = False # Show, don't just tell
example_matches_instruction: bool = False # Example format = instructed format

# Production Readiness
tested_on_10_real_inputs: bool = False # Not just happy path
tested_on_edge_cases: bool = False # Unusual, ambiguous, empty inputs
has_regression_suite: bool = False # At least 10 test cases with expected output
max_tokens_verified: bool = False # Not too low (truncation) or too high (waste)
cost_estimated: bool = False # Know your token cost per call

def score(self) -> float:
all_checks = [
self.output_format_explicit, self.context_precedes_task,
self.has_do_not_list, self.has_escape_hatch, self.length_specified,
self.has_at_least_one_example, self.example_matches_instruction,
self.tested_on_10_real_inputs, self.tested_on_edge_cases,
self.has_regression_suite, self.max_tokens_verified,
]
return sum(all_checks) / len(all_checks)

def report(self) -> str:
issues = []
if not self.output_format_explicit:
issues.append("๐Ÿ”ด CRITICAL: No output format specified - add JSON schema or explicit format")
if not self.context_precedes_task:
issues.append("๐Ÿ”ด CRITICAL: No context before task - model lacks role/background framing")
if not self.has_do_not_list:
issues.append("๐ŸŸก WARNING: No negative constraints - model may do unexpected things")
if not self.has_escape_hatch:
issues.append("๐ŸŸก WARNING: No escape hatch - model will hallucinate when uncertain")
if not self.length_specified:
issues.append("๐ŸŸก WARNING: Length not specified - inconsistent response lengths")
if not self.has_at_least_one_example:
issues.append("๐ŸŸก WARNING: No examples - ambiguity not anchored by demonstration")
if not self.example_matches_instruction:
issues.append("๐Ÿ”ด CRITICAL: Example format doesn't match instruction format")
if not self.tested_on_10_real_inputs:
issues.append("๐Ÿ”ด CRITICAL: Not tested on real inputs - unknown failure modes")
if not self.has_regression_suite:
issues.append("๐ŸŸก WARNING: No regression tests - quality regressions won't be caught")
if not self.max_tokens_verified:
issues.append("๐ŸŸก WARNING: max_tokens not verified - may truncate or waste budget")

score = self.score()
status = "โœ… DEPLOY-READY" if score >= 0.8 else "โš ๏ธ REVIEW NEEDED" if score >= 0.6 else "๐Ÿšซ NOT READY"
return f"Score: {score:.0%} {status}\n\n" + "\n".join(issues)

:::danger The Overcrowded Prompt Anti-Pattern Prompts with more than 10-12 distinct rules are unreliable. Models can't hold 30 constraints in "working memory" simultaneously - they'll follow the ones that are salient and forget the rest. When you find yourself writing a 40-item constraint list, it's a signal to decompose the task into a template system, pipeline, or multiple smaller prompts. :::

:::danger Ambiguous Qualifiers "Be brief" is meaningless without a reference. One sentence? Three sentences? One paragraph? Always anchor qualitative requirements to concrete examples or token counts. "Keep your response under 100 words" is better. "Match the length of this example: [example]" is best. :::

:::warning Contradictory Instructions "Write a casual but professional response" and "Be brief but thorough" are contradictions. The model will pick one interpretation and ignore the other - inconsistently. When you notice contradictory constraints, decide which is actually more important and delete the other. Better to be consistently one thing than incoherently two things. :::

:::tip The Minimal Viable Prompt First Principle Start with the shortest prompt that produces acceptable output. Add complexity only when you find specific failures. A short prompt with an 85% success rate is easier to improve than a long prompt with a 60% success rate and unknown failure modes. Every instruction you add to a prompt is a hypothesis - test it. :::

Interview Q&Aโ€‹

Q: What are the most impactful changes you can make to improve a prompt's reliability?

A: Three highest-leverage changes, in order of impact. First: specify the output format explicitly with a schema. If you need JSON, show the exact field names and types. Format ambiguity is the leading cause of downstream parsing failures and the easiest to fix. Second: add context before the task. The role, domain expertise, audience, purpose, and operating constraints should all come before the actual instruction - they frame how the model interprets everything that follows. Third: add explicit negative constraints - a "DO NOT" list for the most common wrong behaviors you've seen. Models default to common patterns from training data; negative constraints override those defaults. These three changes move most prompts from 60-70% reliability to 90%+.

Q: When should you use XML tags in prompts?

A: Use XML tags when a prompt has multiple distinct sections that the model needs to treat differently. The canonical use cases: (1) RAG prompts with instructions, retrieved documents, and user query - tag each section so the model knows what's an instruction vs. reference material vs. input. (2) Few-shot prompts with multiple examples - <example>, <input>, <output> tags make the structure clear and prevent the model from confusing examples with actual input. (3) Agent prompts with conversation history, tool outputs, and current task - tags prevent the model from confusing historical context with current instructions. For simple single-section prompts, XML tags add complexity without benefit.

Q: How do you handle uncertainty in model outputs to prevent hallucination?

A: Four complementary techniques. First, the escape hatch: explicitly allow null/not_found outputs for anything the model might not know. "If not stated in the document, return null" removes the pressure to hallucinate. Second, confidence levels: ask the model to score its certainty (HIGH/MEDIUM/LOW/UNCERTAIN) alongside each answer - this makes uncertainty visible instead of hidden inside a confident-sounding response. Third, grounding: for factual questions, provide source documents and instruct the model to cite them and only use facts from those sources. Fourth, two-pass verification: generate an answer, then run a second prompt that specifically asks "Does this answer contain any claims not supported by the provided sources?" This catches hallucinations that slipped through the first pass.

Q: What's the difference between system prompts and user message prompts, and when do you use each?

A: System prompts set the AI's role, capabilities, constraints, and operating context - they're the "configuration" layer that persists across the entire conversation. User messages contain the specific task or input for the current turn. In practice: put everything that's true for all possible interactions in the system prompt (persona, format rules, constraints, domain knowledge, negative constraints). Put task-specific content in the user message (the document to analyze, the specific question, the current customer's situation). This separation enables prompt caching (system prompt is cached and reused across calls at ~10% of the token cost), cleaner conversation flow, and easier testing.

Q: How do you systematically improve a prompt that produces inconsistent results?

A: The scientific method applied to prompts. Step 1: collect 10-20 real failing inputs from production logs - not recreations, the exact verbatim inputs. Step 2: run them against the current prompt to confirm the failure is systematic (more than 50% failure rate means it's a prompt issue, not just noise). Step 3: isolate by removing sections one at a time until the failure disappears - this tells you which section is causing the problem. Step 4: diagnose the root cause type: ambiguous instruction (add an example), missing constraint (add a DO NOT), conflicting rules (rewrite to resolve), format mismatch (align example and instruction format). Step 5: apply the fix, run all failing inputs again. Step 6: add them to a regression test suite so future prompt changes don't reintroduce the same failures.

Q: How do you decide how many examples to include in a few-shot prompt?

A: Research shows diminishing returns beyond 7-10 examples for most classification tasks - accuracy plateaus and you're just paying more tokens. Start with 3-4 examples that cover: the most common case, the hardest edge case, and 1-2 contrast cases (similar input, different output). Measure accuracy on a test set. Add more examples only if specific failure patterns emerge that better examples would address. For production systems, use prompt caching to amortize the cost of examples - with caching, a 10-example prompt costs roughly the same as a 2-example prompt after the first call. Dynamic example selection (retrieving the most similar examples to each input) often outperforms static example sets for tasks with high input diversity.

ยฉ 2026 EngineersOfAI. All rights reserved.