Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Prompt Version Management demo on the EngineersOfAI Playground - no code required. :::

Prompt Versioning and Management

The Silent Regression

It started with a seemingly innocuous request: the product team wanted to make the AI assistant "sound more professional." The engineer on call updated the system prompt, changed a few phrases from casual to formal, and deployed. No tests to run - it was just text changes. No review required - who reviews prompt word choices? The change went live in 20 minutes.

Two weeks later, a senior data scientist noticed something in the analytics: user session length had dropped 18% and the "thumbs up" rate had fallen from 74% to 61%. No code had changed. No infrastructure had changed. What had changed?

It took three days of investigation to trace the cause. The "more professional" tone had inadvertently removed examples of natural conversational follow-ups. Users were interpreting the formal responses as conversation-enders rather than conversation-continuers. They'd ask one question, get a stiff response, and leave instead of exploring further.

The fix was simple: revert to the previous prompt. But the previous prompt no longer existed anywhere. The engineer had edited the string in place. The old version was gone. The team spent four hours trying to reconstruct it from memory, Slack messages, and browser history. They got close, but not exact. The metrics never quite returned to the prior baseline.

This is why prompt versioning exists. Not because prompts are code in the traditional sense - but because their effects are as consequential as code, and they need the same discipline.

The Prompt-as-Code Philosophy

Treating prompts as code means applying software engineering discipline to prompt management:

  • Version control: Every change is tracked with who made it, when, and why
  • Immutability: Published versions are never modified - you create new versions
  • Traceability: Every LLM call is linked to the exact prompt version that generated it
  • Rollback: Any version can be restored instantly
  • Testing: Changes go through automated tests before production
  • Review: Significant prompt changes require review like significant code changes

Semantic Versioning for Prompts

Software uses semver: MAJOR.MINOR.PATCH. Prompt versioning adapts this:

PartSoftwarePrompts
MAJORBreaking API changeFundamentally different behavior, changed output schema
MINORNew capabilityAdded new capability, improved quality without behavior change
PATCHBug fixFixed typo, clarified ambiguous instruction, minor wording
import re
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
import hashlib
import json
from pathlib import Path
from datetime import datetime


class VersionBumpType(Enum):
PATCH = "patch"
MINOR = "minor"
MAJOR = "major"


@dataclass
class PromptVersion:
"""Immutable prompt version record."""
id: str # Template identifier: "support/main"
version: str # Semver: "2.1.3"
template: str # The actual prompt text
description: str # What this version does
variables: list[str] # Required variables
change_log: str # What changed from previous version
created_at: str # ISO timestamp
author: str # Who created this version
tags: list[str] = field(default_factory=list)
deprecated: bool = False
deprecation_reason: str = ""

@property
def content_hash(self) -> str:
return hashlib.sha256(self.template.encode()).hexdigest()[:16]

@property
def semver_tuple(self) -> tuple[int, int, int]:
parts = self.version.split(".")
return (int(parts[0]), int(parts[1]), int(parts[2]))

def render(self, **variables) -> str:
missing = [v for v in self.variables if v not in variables]
if missing:
raise ValueError(f"Missing variables for {self.id} v{self.version}: {missing}")
return self.template.format(**variables)

def bump_version(self, bump_type: VersionBumpType) -> str:
major, minor, patch = self.semver_tuple
if bump_type == VersionBumpType.MAJOR:
return f"{major + 1}.0.0"
elif bump_type == VersionBumpType.MINOR:
return f"{major}.{minor + 1}.0"
else:
return f"{major}.{minor}.{patch + 1}"


class PromptRegistry:
"""
Central registry for prompt versions.
Backed by the filesystem with git for history.
"""

def __init__(self, registry_dir: str = "prompt_registry"):
self.registry_dir = Path(registry_dir)
self.registry_dir.mkdir(parents=True, exist_ok=True)
self._cache: dict[str, PromptVersion] = {}

def _version_path(self, prompt_id: str, version: str) -> Path:
"""File path for a specific version."""
safe_id = prompt_id.replace("/", "__")
return self.registry_dir / safe_id / f"v{version}.json"

def _latest_pointer_path(self, prompt_id: str) -> Path:
safe_id = prompt_id.replace("/", "__")
return self.registry_dir / safe_id / "latest.json"

def publish(
self,
prompt_id: str,
template: str,
description: str,
variables: list[str],
change_log: str,
author: str,
bump_type: VersionBumpType = VersionBumpType.MINOR,
tags: list[str] = None,
) -> PromptVersion:
"""Publish a new version of a prompt."""

# Determine version number
try:
current = self.load(prompt_id, "latest")
new_version = current.bump_version(bump_type)
except FileNotFoundError:
new_version = "1.0.0" # First version

version = PromptVersion(
id=prompt_id,
version=new_version,
template=template,
description=description,
variables=variables,
change_log=change_log,
created_at=datetime.utcnow().isoformat(),
author=author,
tags=tags or [],
)

# Save version file
path = self._version_path(prompt_id, new_version)
path.parent.mkdir(parents=True, exist_ok=True)

with open(path, "w") as f:
json.dump({
"id": version.id,
"version": version.version,
"template": version.template,
"description": version.description,
"variables": version.variables,
"change_log": version.change_log,
"created_at": version.created_at,
"author": version.author,
"tags": version.tags,
"content_hash": version.content_hash,
"deprecated": version.deprecated,
"deprecation_reason": version.deprecation_reason,
}, f, indent=2)

# Update latest pointer
with open(self._latest_pointer_path(prompt_id), "w") as f:
json.dump({"version": new_version}, f)

# Invalidate cache
self._cache.pop(f"{prompt_id}:latest", None)

print(f"Published {prompt_id} v{new_version}")
return version

def load(self, prompt_id: str, version: str = "latest") -> PromptVersion:
"""Load a specific version."""
cache_key = f"{prompt_id}:{version}"

if cache_key in self._cache:
return self._cache[cache_key]

if version == "latest":
ptr_path = self._latest_pointer_path(prompt_id)
if not ptr_path.exists():
raise FileNotFoundError(f"Prompt '{prompt_id}' not found")
with open(ptr_path) as f:
version = json.load(f)["version"]

path = self._version_path(prompt_id, version)
if not path.exists():
raise FileNotFoundError(f"Prompt '{prompt_id}' v{version} not found")

with open(path) as f:
data = json.load(f)

pv = PromptVersion(
id=data["id"],
version=data["version"],
template=data["template"],
description=data["description"],
variables=data["variables"],
change_log=data["change_log"],
created_at=data["created_at"],
author=data["author"],
tags=data.get("tags", []),
deprecated=data.get("deprecated", False),
deprecation_reason=data.get("deprecation_reason", ""),
)

self._cache[cache_key] = pv
return pv

def list_versions(self, prompt_id: str) -> list[str]:
safe_id = prompt_id.replace("/", "__")
prompt_dir = self.registry_dir / safe_id
if not prompt_dir.exists():
return []
versions = []
for f in prompt_dir.glob("v*.json"):
v = f.stem[1:] # strip 'v' prefix
if re.match(r'^\d+\.\d+\.\d+$', v):
versions.append(v)
return sorted(versions, key=lambda v: tuple(int(x) for x in v.split('.')))

def deprecate(self, prompt_id: str, version: str, reason: str) -> None:
"""Mark a version as deprecated."""
path = self._version_path(prompt_id, version)
with open(path) as f:
data = json.load(f)
data["deprecated"] = True
data["deprecation_reason"] = reason
with open(path, "w") as f:
json.dump(data, f, indent=2)
self._cache.pop(f"{prompt_id}:{version}", None)

A/B Testing Prompt Versions

A/B testing prompts requires careful experiment design: control group (current version), treatment group (new version), and a metric you're optimizing for.

import hashlib
import anthropic
from dataclasses import dataclass
from typing import Callable
import random


client = anthropic.Anthropic()


@dataclass
class PromptExperiment:
"""An A/B test between two prompt versions."""
experiment_id: str
control_version: str # e.g., "1.2.0"
treatment_version: str # e.g., "2.0.0"
traffic_split: float # 0.0 to 1.0 - fraction seeing treatment
prompt_id: str


@dataclass
class ExperimentResult:
experiment_id: str
version_shown: str
user_id: str
session_id: str
# Metrics to track
response_accepted: Optional[bool] = None
follow_up_count: int = 0
session_duration_seconds: int = 0
explicit_feedback: Optional[str] = None # "thumbs_up", "thumbs_down"


from typing import Optional


class PromptExperimenter:
"""
Manages A/B experiments on prompt versions.
Uses deterministic bucket assignment so the same user
always sees the same variant within an experiment.
"""

def __init__(self, registry: PromptRegistry):
self.registry = registry
self.results: list[ExperimentResult] = []

def _assign_variant(
self,
experiment: PromptExperiment,
user_id: str,
) -> str:
"""
Deterministically assign a user to a variant.
Same user_id + experiment_id always → same variant.
"""
hash_input = f"{experiment.experiment_id}:{user_id}"
hash_val = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = (hash_val % 10000) / 10000.0 # 0.0 to 1.0

if bucket < experiment.traffic_split:
return experiment.treatment_version
return experiment.control_version

def get_prompt_for_user(
self,
experiment: PromptExperiment,
user_id: str,
variables: dict,
) -> tuple[str, str]: # (rendered_prompt, version)
"""Get the right prompt version for this user in this experiment."""
version = self._assign_variant(experiment, user_id)
prompt = self.registry.load(experiment.prompt_id, version)

if prompt.deprecated:
# Safety: don't serve deprecated prompts
version = experiment.control_version
prompt = self.registry.load(experiment.prompt_id, version)

return prompt.render(**variables), version

def record_result(self, result: ExperimentResult) -> None:
self.results.append(result)

def analyze(self, experiment_id: str) -> dict:
"""Compute per-variant metrics."""
exp_results = [r for r in self.results if r.experiment_id == experiment_id]

if not exp_results:
return {"error": "No results for experiment"}

# Group by version
by_version: dict[str, list[ExperimentResult]] = {}
for r in exp_results:
by_version.setdefault(r.version_shown, []).append(r)

analysis = {}
for version, results in by_version.items():
n = len(results)
accepted = [r for r in results if r.response_accepted is True]
thumbs_up = [r for r in results if r.explicit_feedback == "thumbs_up"]
thumbs_down = [r for r in results if r.explicit_feedback == "thumbs_down"]

analysis[version] = {
"n": n,
"acceptance_rate": len(accepted) / n if n > 0 else 0,
"thumbs_up_rate": len(thumbs_up) / n if n > 0 else 0,
"thumbs_down_rate": len(thumbs_down) / n if n > 0 else 0,
"avg_follow_ups": sum(r.follow_up_count for r in results) / n if n > 0 else 0,
"avg_session_duration": sum(r.session_duration_seconds for r in results) / n if n > 0 else 0,
}

return analysis

def statistical_significance(
self,
experiment_id: str,
metric: str = "acceptance_rate",
control_version: str = None,
) -> dict:
"""
Two-proportion z-test for statistical significance.
p < 0.05 = statistically significant difference.
"""
import math

analysis = self.analyze(experiment_id)
versions = list(analysis.keys())

if len(versions) < 2:
return {"error": "Need at least 2 variants"}

exp_results = [r for r in self.results if r.experiment_id == experiment_id]
by_version = {}
for r in exp_results:
by_version.setdefault(r.version_shown, []).append(r)

# Get control and treatment
ctrl_v = control_version or versions[0]
treat_v = versions[1] if versions[0] == ctrl_v else versions[0]

ctrl = analysis[ctrl_v]
treat = analysis[treat_v]

p1 = ctrl[metric]
p2 = treat[metric]
n1 = ctrl["n"]
n2 = treat["n"]

if n1 == 0 or n2 == 0:
return {"error": "Insufficient data"}

# Pooled proportion
p_pool = (p1 * n1 + p2 * n2) / (n1 + n2)
se = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

if se == 0:
return {"error": "Zero standard error (no variance)"}

z = (p2 - p1) / se
# Approximate two-tailed p-value
p_value = 2 * (1 - _normal_cdf(abs(z)))

return {
"control": ctrl_v,
"treatment": treat_v,
"control_metric": p1,
"treatment_metric": p2,
"relative_lift": (p2 - p1) / p1 if p1 > 0 else 0,
"z_score": z,
"p_value": p_value,
"statistically_significant": p_value < 0.05,
"metric": metric,
}


def _normal_cdf(z: float) -> float:
"""Approximate normal CDF using error function."""
import math
return 0.5 * (1 + math.erf(z / math.sqrt(2)))

Deployment Pinning and Rollback

In production, services should pin to specific prompt versions - never latest. This ensures reproducibility and enables instant rollback.

from dataclasses import dataclass
from typing import Optional
import json
from pathlib import Path
import anthropic

client = anthropic.Anthropic()


@dataclass
class DeploymentConfig:
"""
Maps services to pinned prompt versions.
This file is committed to git and reviewed like code.
"""
service: str
environment: str # "production", "staging", "development"
prompt_pins: dict # prompt_id → pinned version


class DeploymentManager:
"""Manages prompt version deployment across environments."""

def __init__(self, registry: PromptRegistry, config_path: str = "prompt_deployment.json"):
self.registry = registry
self.config_path = Path(config_path)
self._config: dict = self._load_config()

def _load_config(self) -> dict:
if self.config_path.exists():
with open(self.config_path) as f:
return json.load(f)
return {"environments": {}}

def _save_config(self) -> None:
with open(self.config_path, "w") as f:
json.dump(self._config, f, indent=2)

def pin(
self,
environment: str,
prompt_id: str,
version: str,
author: str,
reason: str,
) -> None:
"""Pin a prompt version for an environment."""
# Verify the version exists
self.registry.load(prompt_id, version)

env_config = self._config["environments"].setdefault(environment, {})
env_config.setdefault("pins", {})[prompt_id] = {
"version": version,
"pinned_at": datetime.utcnow().isoformat(),
"pinned_by": author,
"reason": reason,
}
self._save_config()
print(f"Pinned {prompt_id} to v{version} in {environment}")

def get_pinned_version(self, environment: str, prompt_id: str) -> str:
"""Get the pinned version for a prompt in an environment."""
try:
return self._config["environments"][environment]["pins"][prompt_id]["version"]
except KeyError:
raise ValueError(f"No pin for {prompt_id} in {environment}")

def get_prompt(self, environment: str, prompt_id: str, **variables) -> str:
"""Get the rendered, pinned prompt for a given environment."""
version = self.get_pinned_version(environment, prompt_id)
prompt = self.registry.load(prompt_id, version)
return prompt.render(**variables)

def rollback(
self,
environment: str,
prompt_id: str,
author: str,
reason: str = "rollback",
) -> str:
"""Roll back to the previous pinned version."""
env_pins = self._config["environments"].get(environment, {}).get("pins", {})

if prompt_id not in env_pins:
raise ValueError(f"No deployment history for {prompt_id} in {environment}")

# Find available versions
available = self.registry.list_versions(prompt_id)
current = env_pins[prompt_id]["version"]

current_idx = available.index(current) if current in available else -1
if current_idx <= 0:
raise ValueError(f"No previous version to roll back to for {prompt_id}")

previous = available[current_idx - 1]
self.pin(environment, prompt_id, previous, author, f"Rollback: {reason}")
print(f"Rolled back {prompt_id} from v{current} to v{previous} in {environment}")
return previous

def diff(self, prompt_id: str, version_a: str, version_b: str) -> dict:
"""Show differences between two versions."""
import difflib

a = self.registry.load(prompt_id, version_a)
b = self.registry.load(prompt_id, version_b)

diff = list(difflib.unified_diff(
a.template.splitlines(keepends=True),
b.template.splitlines(keepends=True),
fromfile=f"v{version_a}",
tofile=f"v{version_b}",
))

return {
"prompt_id": prompt_id,
"from_version": version_a,
"to_version": version_b,
"diff": "".join(diff),
"description_change": f"{a.description}{b.description}",
"variables_added": list(set(b.variables) - set(a.variables)),
"variables_removed": list(set(a.variables) - set(b.variables)),
}

Prompt Change Tracking

Every LLM call should record which prompt version was used. This enables post-hoc analysis: "Why did quality drop on March 14th?" → "Prompt v2.3.1 was deployed that morning."

import anthropic
from dataclasses import dataclass
from typing import Optional
import uuid
from datetime import datetime

client = anthropic.Anthropic()


@dataclass
class TrackedLLMCall:
"""Record of a single LLM call with prompt provenance."""
call_id: str
prompt_id: str
prompt_version: str
prompt_hash: str # Hash of rendered prompt
model: str
user_id: Optional[str]
session_id: str
request_tokens: int
response_tokens: int
latency_ms: float
response_text: str
timestamp: str
experiment_id: Optional[str] = None
quality_score: Optional[float] = None


class TrackedLLMClient:
"""
Wraps anthropic.Anthropic() with automatic call tracking.
Every call records prompt version, tokens, latency.
"""

def __init__(
self,
registry: PromptRegistry,
deployment: DeploymentManager,
environment: str = "production",
):
self.client = anthropic.Anthropic()
self.registry = registry
self.deployment = deployment
self.environment = environment
self.call_log: list[TrackedLLMCall] = []

def call(
self,
prompt_id: str,
user_message: str,
variables: dict,
user_id: Optional[str] = None,
session_id: Optional[str] = None,
model: str = "claude-opus-4-6",
max_tokens: int = 500,
experiment_id: Optional[str] = None,
) -> tuple[str, str]: # (response, call_id)
"""Make a tracked LLM call using a versioned prompt."""

import time
import hashlib

# Get pinned prompt
version = self.deployment.get_pinned_version(self.environment, prompt_id)
prompt_obj = self.registry.load(prompt_id, version)
system = prompt_obj.render(**variables)

# Track the call
call_id = str(uuid.uuid4())
session_id = session_id or str(uuid.uuid4())

start = time.time()
response = self.client.messages.create(
model=model,
max_tokens=max_tokens,
system=system,
messages=[{"role": "user", "content": user_message}]
)
latency_ms = (time.time() - start) * 1000

response_text = response.content[0].text

record = TrackedLLMCall(
call_id=call_id,
prompt_id=prompt_id,
prompt_version=version,
prompt_hash=hashlib.sha256(system.encode()).hexdigest()[:12],
model=model,
user_id=user_id,
session_id=session_id,
request_tokens=response.usage.input_tokens,
response_tokens=response.usage.output_tokens,
latency_ms=latency_ms,
response_text=response_text,
timestamp=datetime.utcnow().isoformat(),
experiment_id=experiment_id,
)
self.call_log.append(record)

return response_text, call_id

def analyze_by_version(self, prompt_id: str) -> dict:
"""Analyze call metrics grouped by prompt version."""
calls = [c for c in self.call_log if c.prompt_id == prompt_id]
by_version: dict[str, list[TrackedLLMCall]] = {}

for call in calls:
by_version.setdefault(call.prompt_version, []).append(call)

analysis = {}
for version, version_calls in by_version.items():
n = len(version_calls)
total_tokens = sum(c.request_tokens + c.response_tokens for c in version_calls)
scored = [c for c in version_calls if c.quality_score is not None]

analysis[version] = {
"n_calls": n,
"avg_latency_ms": sum(c.latency_ms for c in version_calls) / n,
"avg_total_tokens": total_tokens / n,
"avg_quality": sum(c.quality_score for c in scored) / len(scored) if scored else None,
}

return analysis

Prompt Review Workflow

Regression Testing

Before promoting a new version, verify it doesn't regress on known-good cases.

import anthropic
from dataclasses import dataclass
from typing import Callable


client = anthropic.Anthropic()


@dataclass
class RegressionTestCase:
"""A test case that must pass for any version of a prompt."""
case_id: str
input_variables: dict
user_message: str
quality_check: Callable[[str], bool] # Returns True if response is acceptable
quality_description: str # Human-readable description of the check


class PromptRegressionSuite:
"""Suite of regression tests that must pass before any deployment."""

def __init__(self, prompt_id: str, test_cases: list[RegressionTestCase]):
self.prompt_id = prompt_id
self.test_cases = test_cases
self.client = anthropic.Anthropic()

def run_against_version(
self,
registry: PromptRegistry,
version: str,
) -> dict:
"""Run all regression tests against a specific version."""
prompt = registry.load(self.prompt_id, version)
results = []

for tc in self.test_cases:
system = prompt.render(**tc.input_variables)

response = self.client.messages.create(
model="claude-opus-4-6",
max_tokens=400,
system=system,
messages=[{"role": "user", "content": tc.user_message}]
)
response_text = response.content[0].text

passed = tc.quality_check(response_text)
results.append({
"case_id": tc.case_id,
"passed": passed,
"check": tc.quality_description,
"response": response_text[:200] + "..." if len(response_text) > 200 else response_text,
})

total = len(results)
passed = sum(1 for r in results if r["passed"])

return {
"prompt_id": self.prompt_id,
"version": version,
"total": total,
"passed": passed,
"failed": total - passed,
"pass_rate": passed / total if total > 0 else 0,
"deployable": passed == total, # Require 100% pass rate
"results": results,
}

def compare_versions(
self,
registry: PromptRegistry,
version_a: str,
version_b: str,
) -> dict:
"""Compare regression results between two versions."""
results_a = self.run_against_version(registry, version_a)
results_b = self.run_against_version(registry, version_b)

# Find regressions: cases that passed in A but fail in B
pass_a = {r["case_id"] for r in results_a["results"] if r["passed"]}
pass_b = {r["case_id"] for r in results_b["results"] if r["passed"]}

regressions = pass_a - pass_b
improvements = pass_b - pass_a

return {
"control": version_a,
"treatment": version_b,
"control_pass_rate": results_a["pass_rate"],
"treatment_pass_rate": results_b["pass_rate"],
"regressions": list(regressions),
"improvements": list(improvements),
"safe_to_deploy": len(regressions) == 0,
}


# Example regression suite for a support prompt
support_regression = PromptRegressionSuite(
prompt_id="support/main",
test_cases=[
RegressionTestCase(
case_id="free_user_upgrade_mention",
input_variables={"assistant_name": "Nova", "product_name": "Acme", "user_tier": "free", "tone": "friendly"},
user_message="How do I export my data in CSV format?",
quality_check=lambda r: "pro" in r.lower() or "upgrade" in r.lower() or "advanced" in r.lower(),
quality_description="Free user should be mentioned upgrade path for advanced export",
),
RegressionTestCase(
case_id="no_legal_advice",
input_variables={"assistant_name": "Nova", "product_name": "Acme", "user_tier": "pro", "tone": "professional"},
user_message="Is it legal to use your platform for HIPAA-covered data?",
quality_check=lambda r: any(w in r.lower() for w in ["attorney", "lawyer", "legal counsel", "consult"]),
quality_description="Should recommend consulting legal counsel for HIPAA questions",
),
]
)

Common Mistakes

:::danger Editing Prompts In-Place Never edit a published prompt version. Create a new version. The old version must remain accessible for rollback and for understanding why historical outputs look the way they do. "Editing in place" is the equivalent of deploying code with no git history. :::

:::danger Deploying to Production Without Staging The prompt that looks perfect in your notebook can behave differently in production due to real-world input diversity. Always test against real traffic in staging (shadow mode) before touching production traffic. Silent regressions on edge cases are invisible without real traffic exposure. :::

:::warning Not Tracking Prompt Versions in Call Logs If you don't log which prompt version generated each response, you can't do post-hoc analysis. When quality drops, you need to know: "Did this start when we deployed v2.3.1?" Without prompt version in your call logs, you're debugging in the dark. :::

:::tip Prompt Changes Require Review Like Code Changes Significant prompt changes - especially MAJOR version bumps - should require peer review. Prompts are instructions that affect every user interaction. A poorly reviewed change can have as much impact as a poorly reviewed code change. Build this into your engineering culture. :::

Interview Q&A

Q: What does "prompt-as-code" mean and why does it matter?

A: Prompt-as-code means applying the same engineering discipline to prompts that we apply to software: version control with full history, immutable published versions, automated testing, code review for significant changes, deployment pipelines with staging environments, and rollback capabilities. It matters because prompts have the same consequence profile as code: a bad prompt deployed to production affects every user interaction until it's fixed. The asymmetry with traditional code: prompts look harmless (they're just text), so teams often skip the engineering rigor - and then are surprised when they can't roll back or explain why quality changed.

Q: How would you version prompts and what version bump criteria would you use?

A: Use semantic versioning adapted for prompts: MAJOR for changes that fundamentally alter behavior or output schema (a response that used to be JSON now returns markdown - this is a breaking change for consumers), MINOR for improvements that don't break existing behavior (added a new capability, improved quality without changing format), PATCH for tiny fixes (corrected a typo, clarified an ambiguous instruction). Every version is immutable - once published, a version file is never modified. A latest pointer file is updated on each publish and can be overridden for rollback.

Q: How do you A/B test two prompt versions safely?

A: Deterministic bucket assignment: hash(user_id + experiment_id) to assign users to variants. The same user always sees the same variant - this avoids showing inconsistent behavior within a session. Traffic split: start with 5-10% treatment, never jump to 50/50 immediately. Metric selection: pick one primary metric (e.g., session engagement rate) and measure for statistically significant sample sizes (typically 48-96 hours minimum). Two-proportion z-test tells you if the difference is real. Key safety: monitor for extreme negative signals (rejection rate spike, feedback volume surge) that would trigger automatic rollback before the experiment completes.

Q: What should you log for every LLM call to enable prompt performance analysis?

A: Minimum: prompt_id, prompt_version, content_hash (hash of the rendered prompt for exact reproduction), model, request_tokens, response_tokens, latency_ms, user_id, session_id, timestamp. With these fields, you can group calls by prompt version and compare quality metrics over time. When a regression is detected, you can find exactly when it started, which version caused it, and roll back. The content_hash is important for detecting A/B experiment contamination - if two calls have the same prompt_version but different content_hashes, a variable changed, and they're not truly comparable.

Q: How do you handle prompt rollback in a production emergency?

A: Three-step process: First, the deployment config (a git-tracked JSON file) maps environments to pinned prompt versions. Rolling back means updating the pin to the previous version and reloading the config - this takes under 60 seconds and requires no code deployment. Second, if the rollback target is unknown, the call logs tell you which version was in production before the problematic one - you look up the timeline. Third, after the immediate rollback, do a post-mortem: what did the bad version change, why did tests miss it, what regression test should be added to prevent recurrence.

© 2026 EngineersOfAI. All rights reserved.