Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Prompt Version Management demo on the EngineersOfAI Playground - no code required. :::

Prompt Versioning

The Regression No One Saw Coming

The year is early 2024. A B2B SaaS company's flagship product is an AI contract intelligence feature - customers upload legal contracts, and the AI extracts key clauses, important dates, payment terms, and termination conditions. The feature launched in Q4 2023 to strong reviews. Extraction accuracy was 94% on their internal test set. Enterprise customers were signing up. Revenue was growing.

In January, a product manager files a Jira ticket: "The AI sometimes uses jargon that customers find confusing. Can we make the language friendlier?" A junior engineer opens the system prompt file, adds three sentences about using plain, accessible language, and deploys directly to production. It takes forty minutes. No review, no testing - after all, it is just wording. The engineer mentions it in standup as "a small UX improvement." Nobody thinks twice.

Three weeks later, a Fortune 500 customer files a support ticket: the AI is no longer extracting termination clauses correctly. The contract says "either party may terminate with 30 days written notice" and the AI is returning "no termination clause found." The support team investigates. Then another ticket comes in. Then five more. The extraction accuracy for termination clauses has dropped from 96% to 61% across all customers. The root cause, eventually: the "plain language" instruction subtly changed how the model weighted formal legal boilerplate. The model now deprioritizes verbose legal language - which is exactly what termination clauses contain. The engineer who made the change has since moved to a different team. The original system prompt file has been edited four more times since January. There is no version history because the prompts were inlined into Python code. Git log shows the file was changed, but not what it contained at each historical point. They could not roll back because they could not reconstruct what to roll back to. They had to reconstruct the original prompt from memory and Slack messages. This is what the absence of prompt versioning costs in production.

Why This Exists

When software engineers write code, they have fifty years of accumulated tooling for managing change: version control (git), branching strategies (feature branches, gitflow), code review (pull requests), CI pipelines (automated tests block bad merges), staged deployments (dev → staging → production), and instant rollback (revert commit, redeploy). None of this infrastructure exists by default for prompts.

The default state of most LLM applications: prompts live as string literals inside Python files, modified in-place, often duplicated across multiple files, with no structured versioning, no review gate, and no rollback mechanism. This is the equivalent of deploying application code by editing files directly on the production server - a practice the industry rightly eliminated in the 1990s. The reason prompt versioning matters more than most engineers initially appreciate is that a prompt change is a behavior change. It is not documentation. It is not configuration. It is the primary program logic of an LLM feature, and any change to it can alter outputs for every user, in ways that are subtle and non-obvious until they compound.

The discipline of prompt versioning borrows from software version control, applies it to prompt artifacts, and adds the LLM-specific layer of evaluation: because you cannot test a prompt with a == assertion, every version must be scored against a golden eval dataset before it can be promoted. The key insight is that a prompt without an eval score is an unvalidated claim - like deploying code with no tests and no benchmarks.

Core Concepts

Prompts Are Versioned Artifacts, Not Strings

A prompt version is not just the text of the system prompt. It is a complete specification of model behavior that includes every parameter affecting output:

Prompt Version: customer-support/1.3.2
system_prompt: "You are a helpful customer support agent..."
user_template: "Customer question: {input}\nContext: {context}"
model: claude-3-5-sonnet-20241022
temperature: 0.3
max_tokens: 512
created_at: 2024-03-01T09:00:00Z
eval_score: 0.91
eval_dataset: customer-support-golden-v3
status: production
change_summary: "Added wait time acknowledgment before solution offering"

Every field matters. The model version matters because the same system prompt can behave very differently on claude-haiku versus claude-opus-4-6. Temperature matters because 0.0 and 0.7 produce different output distributions. Max tokens matters because a model that runs out of tokens mid-response produces truncated, incorrect outputs. Two prompts with identical text but different temperatures or models are different prompt versions and should have different scores.

Semantic Versioning for Prompts

Borrow semver (MAJOR.MINOR.PATCH) with LLM-specific semantics:

BumpWhen to useEval requirement
MAJORFundamental behavior change - new task, new output format, new personaFull re-evaluation + team sign-off
MINORNew capability - added few-shot examples, new instruction clause, new edge case handlerMust pass eval gate
PATCHWording fix, typo correction, minor clarification that should not change behaviorMust still pass eval gate

The rule is absolute: every change gets a new version number, no exceptions. Even a one-character change is at minimum a PATCH bump. The eval gate runs on every version. There is no "too small to test" change.

The Prompt Registry

A prompt registry is the authoritative versioned store for prompt artifacts. It provides:

  • Version history: every version of every prompt, immutable once written
  • Metadata: author, eval score, eval dataset, promotion date, status
  • Retrieval: load by name + version, or by status label (production, staging)
  • A/B routing: send X% of traffic to version A, Y% to version B
  • Promotion workflow: draft → staging → production, with eval gates at each transition

Git-Based Prompt Versioning

The simplest and most practical approach for most teams is to version prompts as YAML files in Git alongside application code. This gives you history, diff, blame, and pull request workflows for free - no additional infrastructure required.

Directory Structure

prompts/
customer-support/
v1.0.0.yaml # initial version - immutable
v1.1.0.yaml # minor improvement - immutable
v1.2.0.yaml # added empathy instruction - immutable
v1.3.0.yaml # current production - immutable
current -> v1.3.0.yaml # symlink or config pointing to active version
contract-extraction/
v1.0.0.yaml
v2.0.0.yaml # MAJOR: new JSON output format
current -> v2.0.0.yaml
email-summarization/
v1.0.0.yaml
current -> v1.0.0.yaml
CHANGELOG.md # human-readable log of promotions and rollbacks

Critical rule: prompt YAML files are immutable once written. You never edit v1.2.0.yaml. You write v1.3.0.yaml. Immutability is what makes rollback work - you always have a known-good version to return to.

Prompt YAML Format

# prompts/customer-support/v1.3.0.yaml
name: customer-support
version: "1.3.0"
status: production
model: claude-3-5-sonnet-20241022
temperature: 0.3
max_tokens: 512
created_at: "2024-03-01T09:00:00Z"
promoted_at: "2024-03-03T14:22:00Z"
eval_dataset: customer-support-golden-v3
eval_score: 0.91
previous_version: "1.2.0"
change_summary: >
Added explicit instruction to acknowledge wait time before offering solutions.
Fixes CSAT regression identified in Feb A/B test for complaint-type tickets.

system: |
You are a helpful, empathetic customer support agent for Acme Corp.
Your job is to resolve customer issues efficiently and leave customers
feeling heard and valued.

Guidelines:
- Always acknowledge the customer's situation before offering solutions.
If they mention a wait, acknowledge it. If they are frustrated, name that.
- Be concise - aim for responses under 150 words unless the issue is complex
- If you do not know something, say so clearly and offer to escalate
- Never invent policies, prices, or timelines - accuracy over completeness
- Use plain, friendly language - avoid jargon and corporate speak

template: |
Customer question: {input}

Relevant knowledge base context:
{context}

Respond helpfully and concisely.

few_shot_examples:
- input: "My order has not arrived after 10 days"
output: |
Ten days is too long to wait - I am sorry about that. Let me look
into this right away.

Could you share your order number? Once I have that, I can check the
current status and figure out the best next step for you.
- input: "I was charged twice for my subscription"
output: |
Being charged twice is never okay - I apologize for that error and
I want to get this fixed for you today.

Can you confirm the last four digits of the card that was charged?
I will pull up your account and process the duplicate charge refund
immediately.

Full Prompt Registry Implementation

Here is a complete, production-ready prompt registry with version tracking, promotion workflow, A/B testing, rollback, and diff comparison:

# registry/prompt_registry.py
"""
Production prompt registry with:
- Immutable version storage (YAML files)
- Promotion lifecycle: draft → staging → production → deprecated
- Eval score gates at each promotion
- A/B traffic routing with consistent hashing
- One-command rollback
- Diff comparison between versions
"""
import anthropic
import hashlib
import json
import random
import time
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Optional
import yaml


@dataclass
class PromptSpec:
"""
Complete specification of a prompt version.
Every field that affects model behavior is captured here.
"""
name: str
version: str
system: str
template: str
model: str = "claude-3-5-sonnet-20241022"
temperature: float = 0.3
max_tokens: int = 1024
status: str = "draft" # draft | staging | production | deprecated
eval_score: Optional[float] = None
eval_dataset: Optional[str] = None
created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
promoted_at: Optional[str] = None
author: Optional[str] = None
change_summary: Optional[str] = None
previous_version: Optional[str] = None
few_shot_examples: list = field(default_factory=list)

@property
def hash(self) -> str:
"""
Stable content hash. If this changes, something that affects
model behavior changed. Used to detect silent file modifications.
"""
content = (
f"{self.system}|{self.template}|{self.model}"
f"|{self.temperature}|{self.max_tokens}"
)
return hashlib.sha256(content.encode()).hexdigest()[:16]

def format_user_message(self, **kwargs: str) -> str:
"""Fill template variables. Raises KeyError if a variable is missing."""
msg = self.template
for k, v in kwargs.items():
msg = msg.replace(f"{{{k}}}", str(v))
return msg


class PromptRegistry:
"""
File-backed prompt registry. For production at scale, replace the YAML
file backend with DynamoDB, PostgreSQL, or a managed service (Langfuse).
The interface is the same regardless of backend.
"""

PROMOTION_RULES = {
"draft": {"allowed_next": ["staging"], "min_eval_score": 0.75},
"staging": {"allowed_next": ["production"], "min_eval_score": 0.85},
"production": {"allowed_next": ["deprecated"], "min_eval_score": None},
"deprecated": {"allowed_next": [], "min_eval_score": None},
}

def __init__(self, store_dir: str = "prompt_registry"):
self.store = Path(store_dir)
self.store.mkdir(exist_ok=True)
self._routing: dict[str, dict[str, float]] = {}
self._load_routing()

# ── Storage ───────────────────────────────────────────────────────────

def _prompt_dir(self, name: str) -> Path:
d = self.store / name
d.mkdir(exist_ok=True)
return d

def save(self, spec: PromptSpec) -> None:
"""
Save a new prompt version. Enforces immutability - raises if version exists.
You must bump the version number to save any change.
"""
path = self._prompt_dir(spec.name) / f"v{spec.version}.yaml"
if path.exists():
existing = self._load_from_path(path)
if existing.hash != spec.hash:
raise ValueError(
f"Version {spec.version} already exists for '{spec.name}' "
f"with different content. Bump the version number."
)
return # Same content, idempotent

with open(path, "w") as f:
yaml.dump(spec.__dict__, f, default_flow_style=False, allow_unicode=True)
print(f"Saved {spec.name}@{spec.version} (hash={spec.hash})")

def load(self, name: str, version: str) -> PromptSpec:
"""Load a specific prompt version."""
path = self._prompt_dir(name) / f"v{version}.yaml"
if not path.exists():
raise FileNotFoundError(f"Prompt {name}@{version} not found at {path}")
return self._load_from_path(path)

def _load_from_path(self, path: Path) -> PromptSpec:
with open(path) as f:
data = yaml.safe_load(f)
return PromptSpec(**{k: v for k, v in data.items()
if k in PromptSpec.__dataclass_fields__})

def list_versions(self, name: str) -> list[dict]:
"""List all versions for a prompt, newest first."""
versions = []
for f in sorted(self._prompt_dir(name).glob("v*.yaml")):
spec = self._load_from_path(f)
versions.append({
"version": spec.version,
"status": spec.status,
"eval_score": spec.eval_score,
"created_at": spec.created_at,
"hash": spec.hash,
"change_summary": spec.change_summary or "",
"author": spec.author or "",
})
return list(reversed(versions))

def get_production_version(self, name: str) -> Optional[PromptSpec]:
"""Get the current production version for a prompt, if any."""
for v_info in self.list_versions(name):
if v_info["status"] == "production":
return self.load(name, v_info["version"])
return None

# ── Lifecycle Management ──────────────────────────────────────────────

def promote(
self,
name: str,
version: str,
to_status: str,
eval_score: float,
) -> None:
"""
Promote a prompt version to the next status tier.
Enforces:
- Valid promotion path (draft → staging → production)
- Minimum eval score threshold for each tier
"""
spec = self.load(name, version)
rules = self.PROMOTION_RULES.get(spec.status, {})

# Validate promotion path
allowed_next = rules.get("allowed_next", [])
if to_status not in allowed_next:
raise ValueError(
f"Cannot promote {name}@{version} from '{spec.status}' to '{to_status}'. "
f"Allowed transitions: {allowed_next}"
)

# Enforce eval score threshold
min_score = rules.get("min_eval_score")
if min_score is not None and eval_score < min_score:
raise ValueError(
f"Eval score {eval_score:.3f} is below minimum {min_score:.3f} "
f"required for promotion to '{to_status}'. "
f"Improve the prompt and re-run the eval."
)

spec.status = to_status
spec.eval_score = eval_score
spec.promoted_at = datetime.utcnow().isoformat()

path = self._prompt_dir(name) / f"v{version}.yaml"
with open(path, "w") as f:
yaml.dump(spec.__dict__, f, default_flow_style=False, allow_unicode=True)

print(f"Promoted {name}@{version} to {to_status} (eval_score={eval_score:.3f})")

def rollback(self, name: str, to_version: str, reason: str = "") -> None:
"""
Emergency rollback: restore a previous version to production.
Deprecates all current production versions and restores the target.
Routes 100% traffic to the target version immediately.
"""
# Validate target version can be restored
target = self.load(name, to_version)
if target.status == "draft":
raise ValueError(
f"Cannot roll back to a draft version ({to_version}). "
f"Target must be staging, production, or deprecated."
)

# Deprecate current production versions
for v_info in self.list_versions(name):
if v_info["status"] == "production" and v_info["version"] != to_version:
self._set_status(name, v_info["version"], "deprecated")
print(f" Deprecated {name}@{v_info['version']}")

# Restore target to production
self._set_status(name, to_version, "production")

# Update routing to 100% target
self.set_routing(name, {to_version: 1.0})

print(
f"ROLLBACK COMPLETE: {name} → v{to_version}"
+ (f" (reason: {reason})" if reason else "")
)

def _set_status(self, name: str, version: str, status: str) -> None:
spec = self.load(name, version)
spec.status = status
path = self._prompt_dir(name) / f"v{version}.yaml"
with open(path, "w") as f:
yaml.dump(spec.__dict__, f, default_flow_style=False, allow_unicode=True)

# ── A/B Traffic Routing ───────────────────────────────────────────────

def set_routing(self, name: str, routing: dict[str, float]) -> None:
"""
Configure traffic split between prompt versions.
routing = {"1.2.0": 0.90, "1.3.0": 0.10} -- must sum to 1.0

Use consistent hashing in resolve() to ensure the same user
always receives the same prompt version within a session.
"""
total = sum(routing.values())
if abs(total - 1.0) > 0.001:
raise ValueError(f"Traffic fractions must sum to 1.0, got {total:.4f}")
self._routing[name] = routing
self._save_routing()
print(f"Set routing for {name}: {routing}")

def resolve(self, name: str, user_id: Optional[str] = None) -> PromptSpec:
"""
Resolve the active prompt for a request, respecting A/B routing.

If user_id is provided, uses consistent hashing so the same user
always receives the same prompt version across requests in a session.
If no user_id, uses random sampling.
"""
routing = self._routing.get(name)
if routing:
if user_id:
# Consistent hashing: same user → same version
bucket = int(hashlib.md5(
f"{user_id}:{name}".encode()
).hexdigest(), 16) % 10000 / 10000.0
else:
bucket = random.random()

cumulative = 0.0
for version, fraction in routing.items():
cumulative += fraction
if bucket < cumulative:
return self.load(name, version)

# Fallback: load the latest production version
prod = self.get_production_version(name)
if prod:
return prod

raise ValueError(f"No production version found for prompt '{name}'")

def _save_routing(self) -> None:
path = self.store / "_routing.json"
path.write_text(json.dumps(self._routing, indent=2))

def _load_routing(self) -> None:
path = self.store / "_routing.json"
if path.exists():
self._routing = json.loads(path.read_text())

# ── Diff ──────────────────────────────────────────────────────────────

def diff(self, name: str, version_a: str, version_b: str) -> dict:
"""
Show what changed between two prompt versions.
Used in PR reviews to understand what a prompt change actually changes.
"""
a = self.load(name, version_a)
b = self.load(name, version_b)

changes = {}
for field_name in ["system", "template", "model", "temperature", "max_tokens"]:
val_a = getattr(a, field_name)
val_b = getattr(b, field_name)
if val_a != val_b:
changes[field_name] = {
"from": val_a,
"to": val_b,
"type": "behavioral" if field_name in ("system", "template") else "parameter",
}

return {
"name": name,
"from_version": version_a,
"to_version": version_b,
"hash_changed": a.hash != b.hash,
"n_changes": len(changes),
"changes": changes,
"eval_score_delta": (
(b.eval_score or 0) - (a.eval_score or 0)
if a.eval_score and b.eval_score else None
),
}


# ── Full Workflow Demo ────────────────────────────────────────────────────

def demo_full_workflow():
"""
Demonstrates the complete prompt versioning workflow:
save → evaluate → promote to staging → canary → promote to production → rollback
"""
registry = PromptRegistry("demo_registry")
client = anthropic.Anthropic()

# ── 1. Initial version ────────────────────────────────────────────────

v1 = PromptSpec(
name="support",
version="1.0.0",
system="You are a helpful customer support agent for Acme Corp.",
template="Customer: {input}",
change_summary="Initial version",
)
registry.save(v1)

# Simulate eval run → 0.88 score
registry.promote("support", "1.0.0", "staging", eval_score=0.88)
registry.promote("support", "1.0.0", "production", eval_score=0.88)
registry.set_routing("support", {"1.0.0": 1.0})

# ── 2. Improved version ───────────────────────────────────────────────

v2 = PromptSpec(
name="support",
version="1.1.0",
system=(
"You are a helpful, empathetic customer support agent for Acme Corp. "
"Always acknowledge the customer's frustration before offering a solution. "
"Be concise - aim for responses under 150 words."
),
template="Customer: {input}",
change_summary=(
"Added empathy instruction and conciseness guideline. "
"Addresses CSAT complaints about robotic-sounding responses."
),
previous_version="1.0.0",
)
registry.save(v2)

# Eval run → 0.93 score - better than baseline
registry.promote("support", "1.1.0", "staging", eval_score=0.93)

# ── 3. Canary deployment: 10% traffic to new version ─────────────────

registry.set_routing("support", {"1.0.0": 0.90, "1.1.0": 0.10})
print("\nCanary deployment: 90% v1.0.0, 10% v1.1.0")

# After 48h monitoring: quality holds. Promote to production.
registry.promote("support", "1.1.0", "production", eval_score=0.93)
registry.set_routing("support", {"1.1.0": 1.0})
print("Full rollout: 100% v1.1.0")

# ── 4. Show diff ─────────────────────────────────────────────────────

diff = registry.diff("support", "1.0.0", "1.1.0")
print(f"\nDiff from v1.0.0 to v1.1.0:")
print(f" Hash changed: {diff['hash_changed']}")
print(f" Changes: {diff['n_changes']} field(s)")
for field_name, change in diff["changes"].items():
print(f"\n {field_name} ({change['type']}):")
print(f" FROM: {str(change['from'])[:120]}")
print(f" TO: {str(change['to'])[:120]}")

# ── 5. Version history ────────────────────────────────────────────────

print("\nVersion history for 'support':")
for v in registry.list_versions("support"):
score_str = f"{v['eval_score']:.2f}" if v["eval_score"] else "-"
print(
f" v{v['version']:8s} | {v['status']:11s} | score: {score_str:4s} | "
f"{v['change_summary'][:60]}"
)

# ── 6. Use the registry to resolve and call ───────────────────────────

spec = registry.resolve("support", user_id="user-42")
message = client.messages.create(
model=spec.model,
max_tokens=spec.max_tokens,
system=spec.system,
messages=[{
"role": "user",
"content": spec.format_user_message(input="My order is late")
}],
)
print(f"\nResponse (using {spec.name}@{spec.version}):")
print(message.content[0].text)

# ── 7. Emergency rollback ─────────────────────────────────────────────

# If v1.1.0 caused an incident:
# registry.rollback("support", "1.0.0", reason="v1.1.0 regression on foreign currency cases")


if __name__ == "__main__":
demo_full_workflow()

CI Gate for Prompt Changes

The CI gate is the mechanical enforcement of the eval requirement. Without it, the process degrades into a suggestion that busy engineers bypass under deadline pressure. The gate must be automatic, must block merges on failure, and must post results as PR comments so reviewers can see the score context.

# ci/prompt_eval_gate.py
"""
Runs on every PR that touches files under prompts/.
Blocks merge if eval score drops below threshold.
Posts detailed results as a PR comment.
"""
import sys
import json
import time
from pathlib import Path
import anthropic
import yaml


EVAL_THRESHOLD = 0.85


def load_eval_dataset(path: str) -> list[dict]:
with open(path) as f:
return json.load(f)


def get_changed_prompt_files(changed_files: list[str]) -> list[str]:
"""Filter for prompt YAML files changed in this PR."""
return [
f for f in changed_files
if f.startswith("prompts/") and f.endswith(".yaml") and "/v" in f
]


def run_single_eval(spec: dict, example: dict) -> tuple[float, str]:
"""Run one eval case and return (score, reasoning)."""
client = anthropic.Anthropic()

# Format the user message
user_msg = spec["template"]
for k, v in example.items():
if k not in ("expected_behavior", "weight"):
user_msg = user_msg.replace(f"{{{k}}}", str(v))

# Get model response
try:
response = client.messages.create(
model=spec["model"],
max_tokens=spec["max_tokens"],
temperature=spec.get("temperature", 0.3),
system=spec["system"],
messages=[{"role": "user", "content": user_msg}],
)
output = response.content[0].text
except Exception as e:
return 0.0, f"API error: {e}"

# Judge the output
judge_prompt = f"""You are evaluating an AI assistant response for quality.

User input: {example.get('input', user_msg)[:400]}
Expected behavior: {example.get('expected_behavior', 'Helpful, accurate, appropriate response')}
Actual response: {output[:600]}

Rate the response quality from 0.0 to 1.0:
1.0 - Fully satisfies expected behavior, accurate and complete
0.75 - Mostly satisfies with minor gaps
0.5 - Partially satisfies, missing key elements
0.25 - Significant problems
0.0 - Fails entirely or is harmful

Write one brief reason, then on a new line: SCORE: [decimal]"""

judge_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{"role": "user", "content": judge_prompt}],
)
text = judge_response.content[0].text.strip()
score = 0.5
reasoning = text
for line in text.split("\n"):
if "SCORE:" in line:
try:
score = float(line.split("SCORE:")[-1].strip())
break
except ValueError:
pass

return score, reasoning


def evaluate_prompt(spec_data: dict, dataset_path: str) -> dict:
"""Run the full eval for one prompt file. Returns a result dict."""
dataset = load_eval_dataset(dataset_path)
print(f" Evaluating {spec_data['name']}@{spec_data['version']} "
f"on {len(dataset)} examples...")

scores = []
failures = []
for i, example in enumerate(dataset):
score, reasoning = run_single_eval(spec_data, example)
weight = example.get("weight", 1.0)
scores.append(score * weight)
if score < 0.60:
failures.append({
"input": str(example.get("input", ""))[:80],
"score": score,
"reasoning": reasoning[:120],
})
print(f" [{i+1:3d}/{len(dataset)}] score={score:.2f}")
time.sleep(0.2) # Rate limiting

mean_score = sum(scores) / len(scores) if scores else 0.0
return {
"prompt_name": spec_data["name"],
"prompt_version": spec_data["version"],
"n_cases": len(dataset),
"mean_score": mean_score,
"passed": mean_score >= EVAL_THRESHOLD,
"threshold": EVAL_THRESHOLD,
"failures": failures[:3], # Top 3 failures for PR comment
}


def main(changed_files_path: str) -> int:
"""Return exit code: 0 = all passed, 1 = at least one failed."""
with open(changed_files_path) as f:
changed_files = [line.strip() for line in f if line.strip()]

changed_prompts = get_changed_prompt_files(changed_files)

if not changed_prompts:
print("No prompt files changed. Skipping prompt eval gate.")
return 0

print(f"Found {len(changed_prompts)} changed prompt file(s). Running evals...")
all_results = []

for prompt_file in changed_prompts:
with open(prompt_file) as f:
spec_data = yaml.safe_load(f)

dataset_path = f"eval_datasets/{spec_data['name']}-golden.json"
if not Path(dataset_path).exists():
print(f"WARNING: No eval dataset at {dataset_path}. Skipping.")
continue

result = evaluate_prompt(spec_data, dataset_path)
all_results.append(result)

status = "PASS" if result["passed"] else "FAIL"
print(
f"\n [{status}] {result['prompt_name']}@{result['prompt_version']}: "
f"score={result['mean_score']:.3f} (threshold={result['threshold']})"
)
if result["failures"]:
print(" Failed cases:")
for f in result["failures"]:
print(f" score={f['score']:.2f}: {f['input']}...")

# Write machine-readable results for CI step
with open("eval_gate_results.json", "w") as f:
json.dump(all_results, f, indent=2)

all_passed = all(r["passed"] for r in all_results)
if not all_passed:
print(
f"\nEVAL GATE FAILED. "
f"{sum(1 for r in all_results if not r['passed'])} prompt(s) below threshold. "
f"Improve the prompt and re-run."
)
else:
print(f"\nEVAL GATE PASSED. All {len(all_results)} prompt(s) above threshold.")

return 0 if all_passed else 1


if __name__ == "__main__":
path = sys.argv[1] if len(sys.argv) > 1 else "changed_files.txt"
sys.exit(main(path))

And the corresponding GitHub Actions workflow:

# .github/workflows/prompt-ci.yml
name: Prompt CI Gate

on:
pull_request:
paths:
- 'prompts/**'

jobs:
prompt-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for accurate diff

- uses: actions/setup-python@v4
with:
python-version: '3.11'

- run: pip install anthropic pyyaml

- name: Detect changed prompt files
run: |
git diff --name-only origin/${{ github.base_ref }}...HEAD \
| grep '^prompts/' > changed_files.txt || true
echo "Changed files:"
cat changed_files.txt

- name: Run prompt eval gate
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python ci/prompt_eval_gate.py changed_files.txt

- name: Post results to PR
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
let body = '## Prompt Eval Gate\n\n';
try {
const results = JSON.parse(fs.readFileSync('eval_gate_results.json'));
for (const r of results) {
const icon = r.passed ? '✅' : '❌';
body += `${icon} **${r.prompt_name}@${r.prompt_version}**\n`;
body += `- Score: \`${r.mean_score.toFixed(3)}\` (threshold: ${r.threshold})\n`;
body += `- Cases: ${r.n_cases}\n`;
if (r.failures.length > 0) {
body += `- Failed cases:\n`;
for (const f of r.failures) {
body += ` - score=${f.score.toFixed(2)}: ${f.input}\n`;
}
}
body += '\n';
}
} catch (e) {
body += '_No eval results found._\n';
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});

Prompt Review Checklist

Every prompt PR review should answer these questions before approving:

PROMPT CHANGE REVIEW CHECKLIST
================================

Content Review:
[ ] Does the change_summary accurately describe what changed and why?
[ ] Does the change affect any behavior covered by existing eval cases?
[ ] Are there new behaviors (new instructions, new edge case handling)
that should have new eval cases written?
[ ] Does the system prompt contain any internally contradictory instructions?
[ ] Has the template format changed in a way that breaks downstream parsers?
[ ] Are there any PII risks in new few-shot examples?

Parameter Review:
[ ] If the model changed, is the new model version pinned explicitly?
[ ] If temperature changed, is this intentional? What behavior does it change?
[ ] If max_tokens changed, will this cause truncation or unexpected behavior?

Eval Review:
[ ] Was the eval suite run with this new version?
[ ] Is the eval score above threshold (0.85 for staging, 0.85 for production)?
[ ] Is the eval score at least as high as the previous production version?
[ ] Are there subset scores that dropped (edge_case, adversarial, etc.)?
[ ] Was the change_summary updated to match what actually changed?

External Prompt Registries

For larger teams that need collaboration UI, A/B testing dashboards, and deeper evaluation integration, managed prompt registries provide significant value:

PlatformPrompt StorageVersion HistoryA/B TestingEval IntegrationSelf-Host
LangfuseYesYesYesYes (Score API)Yes
LangSmith HubYesYesNoYes (Datasets)No
HumanloopYesYesYesYesPartial
PromptLayerYesYesBasicWebhookNo
# Using Langfuse prompt registry - production pattern
from langfuse import Langfuse
import anthropic

langfuse = Langfuse()
client = anthropic.Anthropic()


def run_with_langfuse_prompt(user_input: str, user_id: str) -> str:
"""
Resolve the current production prompt from Langfuse registry.
Langfuse caches the prompt with a 60-second TTL by default.
The trace automatically links to the prompt version used.
"""
# Fetch the current production prompt - cached for 60s
prompt = langfuse.get_prompt("customer-support", label="production")

# Record the prompt version in the trace context
compiled = prompt.compile(
input=user_input,
context="No additional context",
)

response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[{"role": "user", "content": compiled}],
)
return response.content[0].text

Production Engineering Notes

Never Edit Prompts In-Place

The most dangerous pattern in LLM engineering: editing a prompt string directly in application code without creating a new version file. Even if the file is in Git, the lack of a formal version bump and eval gate means the change bypasses all safeguards. Always write a new version YAML; never modify an existing one. Treat existing prompt files as immutable - if you need to change it, you create a new file.

A/B Testing Is Not Optional at Scale

Once you have more than 5,000 requests per day, you have sufficient volume to run statistically significant A/B tests within 48 hours. Use A/B routing on every non-trivial prompt change instead of doing a direct cutover. The canary pattern (5–10% to new version) is the safest default. Never route 100% traffic to a new prompt version without at least 24 hours of monitoring on a canary cohort.

Prompt Change Log Is an Operational Asset

Maintain a human-readable CHANGELOG.md in your prompts directory. For every promotion, record: the date, the version, who promoted it, why, what eval score it achieved, and what the previous version's score was. When an incident happens at 3 AM, this log is what helps you understand which prompt was live and why it was chosen. Automate this log as part of your promotion script so it is never skipped.

:::warning Prompt Changes Need the Same Review as Code Changes A prompt that gains one sentence can break behavior in edge cases you did not test. Every prompt change requires a PR review by at least one engineer who understands the feature. The reviewer must look at the diff, read the change_summary, and verify the eval score. "It's just wording" is not an acceptable review posture for production LLM prompts. :::

:::danger Never Store Sensitive Data in Prompt Templates Prompt YAML files are tracked in Git. Never include API keys, customer data samples, PII, passwords, or confidential business logic in prompt templates. Use template variables ({customer_name}, {context}) for any dynamic data. Treat prompt files with the same security discipline as application configuration files. :::

:::tip Prompt Files as Living Documentation A well-maintained prompt YAML file is self-documenting: it tells you what the feature does (system prompt), what model and parameters it uses, who wrote it, why each version changed, and what its quality score is. Engineers joining the team can understand the entire history of an AI feature by reading the prompt files. Invest in good change summaries - they pay dividends at 3 AM. :::

Interview Q&A

Q1: Why do prompts need versioning if they are just strings?

Prompts are not just strings - they are the primary program logic for an LLM feature. A one-word change can alter model behavior for all users in ways that are subtle, non-obvious, and sometimes catastrophic. Without versioning, you lose the ability to audit what was running at any point in time, roll back to a known-good state, or understand which change caused a regression. The same way you would not edit application code directly on a production server without version control, you should not edit prompts without a versioning system and an eval gate. The fintech story at the start of this module is the canonical example: unreviewed prompt edits caused a three-week silent regression that cost days to diagnose because nobody could reconstruct which prompt was live when the problem started.

Q2: How would you implement A/B testing for prompt versions in production?

The key is traffic routing at the point of prompt resolution, not at the application level. Maintain a routing configuration - for example, {"1.2.0": 0.90, "1.3.0": 0.10} - and resolve the prompt version probabilistically before making the LLM call. For consistency within a user session, use consistent hashing on the user ID so the same user always receives the same prompt version. Record the version used in every trace. After sufficient traffic - typically 48–72 hours for a moderate-traffic product - compare quality scores, user feedback rates, and task completion rates between versions using statistical significance testing (Mann-Whitney U or t-test). Only promote the new version to 100% if it is statistically equivalent to or better than the incumbent, and never if it is worse on any critical subset (edge cases, adversarial inputs).

Q3: What fields belong in a prompt version artifact, beyond the prompt text?

At minimum: prompt name, semantic version string, model name with explicit version (not just model family), temperature, max_tokens, system message text, user message template, author, creation timestamp, eval dataset name, eval score, promotion status (draft/staging/production/deprecated), previous version pointer, and change summary. The eval score and dataset reference are the most commonly omitted and the most important for later debugging. If you also store few-shot examples separately from the template, you can version them independently, which is useful when examples need updating more frequently than core instructions. The content hash of the behavioral fields (system + template + model + temperature + max_tokens) is also valuable - it detects accidental file modifications that do not bump the version number.

Q4: How do you handle prompt versioning when multiple teams share the same prompt?

Treat it like a shared library with formal ownership. One team owns each prompt; other teams are consumers. Owners review all changes. Consumers pin to a specific version (for stability) or to the production label (to automatically receive the latest approved version). Semver signaling guides consumers: MAJOR bumps are breaking changes that require consumer review before adoption; MINOR and PATCH bumps are backward-compatible and can be absorbed automatically. Store prompts in a central registry (Langfuse or a shared Git repo). Other teams never copy prompt text into their codebases - they always reference it by name and version through the registry API. This prevents the "prompt as duplicated string" anti-pattern, where the same prompt exists in ten places with slight variations that none of the teams fully understand.

Q5: What is the minimum viable prompt versioning setup for a two-person startup?

Three things. First, store all prompts in YAML files in a prompts/ directory in your Git repository - never as inline strings in Python or TypeScript files. One file per version, immutable. Second, require a pull request for every prompt change, no exceptions - even one-line tweaks. The PR review does not need to be elaborate, but it creates an audit trail and forces at least one other person to see the change. Third, write a 20-example golden eval dataset for each LLM feature and run a scoring script (LLM judge) as part of every PR. This takes one day to set up and is the single highest-ROI investment in LLMOps you can make. That is the complete minimum viable setup. Everything else - managed registries, A/B routing infrastructure, automated canary deployments - adds value but is not required for early-stage teams.

© 2026 EngineersOfAI. All rights reserved.