What is prompt versioning?

Treat prompts as code - semantic versioning, A/B testing, rollback strategies, and prompt registries for production LLM systems.

How does prompt management work in practice?

Prompt Versioning and Management covers prompt versioning, prompt management, prompt as code from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/prompt-engineering/prompt-versioning-and-management

What is the difference between prompt versioning and prompt as code?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/prompt-engineering/prompt-versioning-and-management

:::tip 🎮 Interactive Playground Visualize this concept: Try the Prompt Version Management demo on the EngineersOfAI Playground - no code required. :::

Prompt Versioning and Management

The Silent Regression

It started with a seemingly innocuous request: the product team wanted to make the AI assistant "sound more professional." The engineer on call updated the system prompt, changed a few phrases from casual to formal, and deployed. No tests to run - it was just text changes. No review required - who reviews prompt word choices? The change went live in 20 minutes.

Two weeks later, a senior data scientist noticed something in the analytics: user session length had dropped 18% and the "thumbs up" rate had fallen from 74% to 61%. No code had changed. No infrastructure had changed. What had changed?

It took three days of investigation to trace the cause. The "more professional" tone had inadvertently removed examples of natural conversational follow-ups. Users were interpreting the formal responses as conversation-enders rather than conversation-continuers. They'd ask one question, get a stiff response, and leave instead of exploring further.

The fix was simple: revert to the previous prompt. But the previous prompt no longer existed anywhere. The engineer had edited the string in place. The old version was gone. The team spent four hours trying to reconstruct it from memory, Slack messages, and browser history. They got close, but not exact. The metrics never quite returned to the prior baseline.

This is why prompt versioning exists. Not because prompts are code in the traditional sense - but because their effects are as consequential as code, and they need the same discipline.

The Prompt-as-Code Philosophy

Treating prompts as code means applying software engineering discipline to prompt management:

Version control: Every change is tracked with who made it, when, and why
Immutability: Published versions are never modified - you create new versions
Traceability: Every LLM call is linked to the exact prompt version that generated it
Rollback: Any version can be restored instantly
Testing: Changes go through automated tests before production
Review: Significant prompt changes require review like significant code changes

Semantic Versioning for Prompts

Software uses semver: MAJOR.MINOR.PATCH. Prompt versioning adapts this:

Part	Software	Prompts
MAJOR	Breaking API change	Fundamentally different behavior, changed output schema
MINOR	New capability	Added new capability, improved quality without behavior change
PATCH	Bug fix	Fixed typo, clarified ambiguous instruction, minor wording

import re
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
import hashlib
import json
from pathlib import Path
from datetime import datetime


class VersionBumpType(Enum):
    PATCH = "patch"
    MINOR = "minor"
    MAJOR = "major"


@dataclass
class PromptVersion:
    """Immutable prompt version record."""
    id: str                  # Template identifier: "support/main"
    version: str             # Semver: "2.1.3"
    template: str            # The actual prompt text
    description: str         # What this version does
    variables: list[str]     # Required variables
    change_log: str          # What changed from previous version
    created_at: str          # ISO timestamp
    author: str              # Who created this version
    tags: list[str] = field(default_factory=list)
    deprecated: bool = False
    deprecation_reason: str = ""

    @property
    def content_hash(self) -> str:
        return hashlib.sha256(self.template.encode()).hexdigest()[:16]

    @property
    def semver_tuple(self) -> tuple[int, int, int]:
        parts = self.version.split(".")
        return (int(parts[0]), int(parts[1]), int(parts[2]))

    def render(self, **variables) -> str:
        missing = [v for v in self.variables if v not in variables]
        if missing:
            raise ValueError(f"Missing variables for {self.id} v{self.version}: {missing}")
        return self.template.format(**variables)

    def bump_version(self, bump_type: VersionBumpType) -> str:
        major, minor, patch = self.semver_tuple
        if bump_type == VersionBumpType.MAJOR:
            return f"{major + 1}.0.0"
        elif bump_type == VersionBumpType.MINOR:
            return f"{major}.{minor + 1}.0"
        else:
            return f"{major}.{minor}.{patch + 1}"


class PromptRegistry:
    """
    Central registry for prompt versions.
    Backed by the filesystem with git for history.
    """

    def __init__(self, registry_dir: str = "prompt_registry"):
        self.registry_dir = Path(registry_dir)
        self.registry_dir.mkdir(parents=True, exist_ok=True)
        self._cache: dict[str, PromptVersion] = {}

    def _version_path(self, prompt_id: str, version: str) -> Path:
        """File path for a specific version."""
        safe_id = prompt_id.replace("/", "__")
        return self.registry_dir / safe_id / f"v{version}.json"

    def _latest_pointer_path(self, prompt_id: str) -> Path:
        safe_id = prompt_id.replace("/", "__")
        return self.registry_dir / safe_id / "latest.json"

    def publish(
        self,
        prompt_id: str,
        template: str,
        description: str,
        variables: list[str],
        change_log: str,
        author: str,
        bump_type: VersionBumpType = VersionBumpType.MINOR,
        tags: list[str] = None,
    ) -> PromptVersion:
        """Publish a new version of a prompt."""

        # Determine version number
        try:
            current = self.load(prompt_id, "latest")
            new_version = current.bump_version(bump_type)
        except FileNotFoundError:
            new_version = "1.0.0"  # First version

        version = PromptVersion(
            id=prompt_id,
            version=new_version,
            template=template,
            description=description,
            variables=variables,
            change_log=change_log,
            created_at=datetime.utcnow().isoformat(),
            author=author,
            tags=tags or [],
        )

        # Save version file
        path = self._version_path(prompt_id, new_version)
        path.parent.mkdir(parents=True, exist_ok=True)

        with open(path, "w") as f:
            json.dump({
                "id": version.id,
                "version": version.version,
                "template": version.template,
                "description": version.description,
                "variables": version.variables,
                "change_log": version.change_log,
                "created_at": version.created_at,
                "author": version.author,
                "tags": version.tags,
                "content_hash": version.content_hash,
                "deprecated": version.deprecated,
                "deprecation_reason": version.deprecation_reason,
            }, f, indent=2)

        # Update latest pointer
        with open(self._latest_pointer_path(prompt_id), "w") as f:
            json.dump({"version": new_version}, f)

        # Invalidate cache
        self._cache.pop(f"{prompt_id}:latest", None)

        print(f"Published {prompt_id} v{new_version}")
        return version

    def load(self, prompt_id: str, version: str = "latest") -> PromptVersion:
        """Load a specific version."""
        cache_key = f"{prompt_id}:{version}"

        if cache_key in self._cache:
            return self._cache[cache_key]

        if version == "latest":
            ptr_path = self._latest_pointer_path(prompt_id)
            if not ptr_path.exists():
                raise FileNotFoundError(f"Prompt '{prompt_id}' not found")
            with open(ptr_path) as f:
                version = json.load(f)["version"]

        path = self._version_path(prompt_id, version)
        if not path.exists():
            raise FileNotFoundError(f"Prompt '{prompt_id}' v{version} not found")

        with open(path) as f:
            data = json.load(f)

        pv = PromptVersion(
            id=data["id"],
            version=data["version"],
            template=data["template"],
            description=data["description"],
            variables=data["variables"],
            change_log=data["change_log"],
            created_at=data["created_at"],
            author=data["author"],
            tags=data.get("tags", []),
            deprecated=data.get("deprecated", False),
            deprecation_reason=data.get("deprecation_reason", ""),
        )

        self._cache[cache_key] = pv
        return pv

    def list_versions(self, prompt_id: str) -> list[str]:
        safe_id = prompt_id.replace("/", "__")
        prompt_dir = self.registry_dir / safe_id
        if not prompt_dir.exists():
            return []
        versions = []
        for f in prompt_dir.glob("v*.json"):
            v = f.stem[1:]  # strip 'v' prefix
            if re.match(r'^\d+\.\d+\.\d+$', v):
                versions.append(v)
        return sorted(versions, key=lambda v: tuple(int(x) for x in v.split('.')))

    def deprecate(self, prompt_id: str, version: str, reason: str) -> None:
        """Mark a version as deprecated."""
        path = self._version_path(prompt_id, version)
        with open(path) as f:
            data = json.load(f)
        data["deprecated"] = True
        data["deprecation_reason"] = reason
        with open(path, "w") as f:
            json.dump(data, f, indent=2)
        self._cache.pop(f"{prompt_id}:{version}", None)

A/B Testing Prompt Versions

A/B testing prompts requires careful experiment design: control group (current version), treatment group (new version), and a metric you're optimizing for.

import hashlib
import anthropic
from dataclasses import dataclass
from typing import Callable
import random


client = anthropic.Anthropic()


@dataclass
class PromptExperiment:
    """An A/B test between two prompt versions."""
    experiment_id: str
    control_version: str     # e.g., "1.2.0"
    treatment_version: str   # e.g., "2.0.0"
    traffic_split: float     # 0.0 to 1.0 - fraction seeing treatment
    prompt_id: str


@dataclass
class ExperimentResult:
    experiment_id: str
    version_shown: str
    user_id: str
    session_id: str
    # Metrics to track
    response_accepted: Optional[bool] = None
    follow_up_count: int = 0
    session_duration_seconds: int = 0
    explicit_feedback: Optional[str] = None  # "thumbs_up", "thumbs_down"


from typing import Optional


class PromptExperimenter:
    """
    Manages A/B experiments on prompt versions.
    Uses deterministic bucket assignment so the same user
    always sees the same variant within an experiment.
    """

    def __init__(self, registry: PromptRegistry):
        self.registry = registry
        self.results: list[ExperimentResult] = []

    def _assign_variant(
        self,
        experiment: PromptExperiment,
        user_id: str,
    ) -> str:
        """
        Deterministically assign a user to a variant.
        Same user_id + experiment_id always → same variant.
        """
        hash_input = f"{experiment.experiment_id}:{user_id}"
        hash_val = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_val % 10000) / 10000.0  # 0.0 to 1.0

        if bucket < experiment.traffic_split:
            return experiment.treatment_version
        return experiment.control_version

    def get_prompt_for_user(
        self,
        experiment: PromptExperiment,
        user_id: str,
        variables: dict,
    ) -> tuple[str, str]:  # (rendered_prompt, version)
        """Get the right prompt version for this user in this experiment."""
        version = self._assign_variant(experiment, user_id)
        prompt = self.registry.load(experiment.prompt_id, version)

        if prompt.deprecated:
            # Safety: don't serve deprecated prompts
            version = experiment.control_version
            prompt = self.registry.load(experiment.prompt_id, version)

        return prompt.render(**variables), version

    def record_result(self, result: ExperimentResult) -> None:
        self.results.append(result)

    def analyze(self, experiment_id: str) -> dict:
        """Compute per-variant metrics."""
        exp_results = [r for r in self.results if r.experiment_id == experiment_id]

        if not exp_results:
            return {"error": "No results for experiment"}

        # Group by version
        by_version: dict[str, list[ExperimentResult]] = {}
        for r in exp_results:
            by_version.setdefault(r.version_shown, []).append(r)

        analysis = {}
        for version, results in by_version.items():
            n = len(results)
            accepted = [r for r in results if r.response_accepted is True]
            thumbs_up = [r for r in results if r.explicit_feedback == "thumbs_up"]
            thumbs_down = [r for r in results if r.explicit_feedback == "thumbs_down"]

            analysis[version] = {
                "n": n,
                "acceptance_rate": len(accepted) / n if n > 0 else 0,
                "thumbs_up_rate": len(thumbs_up) / n if n > 0 else 0,
                "thumbs_down_rate": len(thumbs_down) / n if n > 0 else 0,
                "avg_follow_ups": sum(r.follow_up_count for r in results) / n if n > 0 else 0,
                "avg_session_duration": sum(r.session_duration_seconds for r in results) / n if n > 0 else 0,
            }

        return analysis

    def statistical_significance(
        self,
        experiment_id: str,
        metric: str = "acceptance_rate",
        control_version: str = None,
    ) -> dict:
        """
        Two-proportion z-test for statistical significance.
        p < 0.05 = statistically significant difference.
        """
        import math

        analysis = self.analyze(experiment_id)
        versions = list(analysis.keys())

        if len(versions) < 2:
            return {"error": "Need at least 2 variants"}

        exp_results = [r for r in self.results if r.experiment_id == experiment_id]
        by_version = {}
        for r in exp_results:
            by_version.setdefault(r.version_shown, []).append(r)

        # Get control and treatment
        ctrl_v = control_version or versions[0]
        treat_v = versions[1] if versions[0] == ctrl_v else versions[0]

        ctrl = analysis[ctrl_v]
        treat = analysis[treat_v]

        p1 = ctrl[metric]
        p2 = treat[metric]
        n1 = ctrl["n"]
        n2 = treat["n"]

        if n1 == 0 or n2 == 0:
            return {"error": "Insufficient data"}

        # Pooled proportion
        p_pool = (p1 * n1 + p2 * n2) / (n1 + n2)
        se = math.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

        if se == 0:
            return {"error": "Zero standard error (no variance)"}

        z = (p2 - p1) / se
        # Approximate two-tailed p-value
        p_value = 2 * (1 - _normal_cdf(abs(z)))

        return {
            "control": ctrl_v,
            "treatment": treat_v,
            "control_metric": p1,
            "treatment_metric": p2,
            "relative_lift": (p2 - p1) / p1 if p1 > 0 else 0,
            "z_score": z,
            "p_value": p_value,
            "statistically_significant": p_value < 0.05,
            "metric": metric,
        }


def _normal_cdf(z: float) -> float:
    """Approximate normal CDF using error function."""
    import math
    return 0.5 * (1 + math.erf(z / math.sqrt(2)))

Deployment Pinning and Rollback

In production, services should pin to specific prompt versions - never latest. This ensures reproducibility and enables instant rollback.

from dataclasses import dataclass
from typing import Optional
import json
from pathlib import Path
import anthropic

client = anthropic.Anthropic()


@dataclass
class DeploymentConfig:
    """
    Maps services to pinned prompt versions.
    This file is committed to git and reviewed like code.
    """
    service: str
    environment: str     # "production", "staging", "development"
    prompt_pins: dict    # prompt_id → pinned version


class DeploymentManager:
    """Manages prompt version deployment across environments."""

    def __init__(self, registry: PromptRegistry, config_path: str = "prompt_deployment.json"):
        self.registry = registry
        self.config_path = Path(config_path)
        self._config: dict = self._load_config()

    def _load_config(self) -> dict:
        if self.config_path.exists():
            with open(self.config_path) as f:
                return json.load(f)
        return {"environments": {}}

    def _save_config(self) -> None:
        with open(self.config_path, "w") as f:
            json.dump(self._config, f, indent=2)

    def pin(
        self,
        environment: str,
        prompt_id: str,
        version: str,
        author: str,
        reason: str,
    ) -> None:
        """Pin a prompt version for an environment."""
        # Verify the version exists
        self.registry.load(prompt_id, version)

        env_config = self._config["environments"].setdefault(environment, {})
        env_config.setdefault("pins", {})[prompt_id] = {
            "version": version,
            "pinned_at": datetime.utcnow().isoformat(),
            "pinned_by": author,
            "reason": reason,
        }
        self._save_config()
        print(f"Pinned {prompt_id} to v{version} in {environment}")

    def get_pinned_version(self, environment: str, prompt_id: str) -> str:
        """Get the pinned version for a prompt in an environment."""
        try:
            return self._config["environments"][environment]["pins"][prompt_id]["version"]
        except KeyError:
            raise ValueError(f"No pin for {prompt_id} in {environment}")

    def get_prompt(self, environment: str, prompt_id: str, **variables) -> str:
        """Get the rendered, pinned prompt for a given environment."""
        version = self.get_pinned_version(environment, prompt_id)
        prompt = self.registry.load(prompt_id, version)
        return prompt.render(**variables)

    def rollback(
        self,
        environment: str,
        prompt_id: str,
        author: str,
        reason: str = "rollback",
    ) -> str:
        """Roll back to the previous pinned version."""
        env_pins = self._config["environments"].get(environment, {}).get("pins", {})

        if prompt_id not in env_pins:
            raise ValueError(f"No deployment history for {prompt_id} in {environment}")

        # Find available versions
        available = self.registry.list_versions(prompt_id)
        current = env_pins[prompt_id]["version"]

        current_idx = available.index(current) if current in available else -1
        if current_idx <= 0:
            raise ValueError(f"No previous version to roll back to for {prompt_id}")

        previous = available[current_idx - 1]
        self.pin(environment, prompt_id, previous, author, f"Rollback: {reason}")
        print(f"Rolled back {prompt_id} from v{current} to v{previous} in {environment}")
        return previous

    def diff(self, prompt_id: str, version_a: str, version_b: str) -> dict:
        """Show differences between two versions."""
        import difflib

        a = self.registry.load(prompt_id, version_a)
        b = self.registry.load(prompt_id, version_b)

        diff = list(difflib.unified_diff(
            a.template.splitlines(keepends=True),
            b.template.splitlines(keepends=True),
            fromfile=f"v{version_a}",
            tofile=f"v{version_b}",
        ))

        return {
            "prompt_id": prompt_id,
            "from_version": version_a,
            "to_version": version_b,
            "diff": "".join(diff),
            "description_change": f"{a.description} → {b.description}",
            "variables_added": list(set(b.variables) - set(a.variables)),
            "variables_removed": list(set(a.variables) - set(b.variables)),
        }

Prompt Change Tracking

Every LLM call should record which prompt version was used. This enables post-hoc analysis: "Why did quality drop on March 14th?" → "Prompt v2.3.1 was deployed that morning."

import anthropic
from dataclasses import dataclass
from typing import Optional
import uuid
from datetime import datetime

client = anthropic.Anthropic()


@dataclass
class TrackedLLMCall:
    """Record of a single LLM call with prompt provenance."""
    call_id: str
    prompt_id: str
    prompt_version: str
    prompt_hash: str       # Hash of rendered prompt
    model: str
    user_id: Optional[str]
    session_id: str
    request_tokens: int
    response_tokens: int
    latency_ms: float
    response_text: str
    timestamp: str
    experiment_id: Optional[str] = None
    quality_score: Optional[float] = None


class TrackedLLMClient:
    """
    Wraps anthropic.Anthropic() with automatic call tracking.
    Every call records prompt version, tokens, latency.
    """

    def __init__(
        self,
        registry: PromptRegistry,
        deployment: DeploymentManager,
        environment: str = "production",
    ):
        self.client = anthropic.Anthropic()
        self.registry = registry
        self.deployment = deployment
        self.environment = environment
        self.call_log: list[TrackedLLMCall] = []

    def call(
        self,
        prompt_id: str,
        user_message: str,
        variables: dict,
        user_id: Optional[str] = None,
        session_id: Optional[str] = None,
        model: str = "claude-opus-4-6",
        max_tokens: int = 500,
        experiment_id: Optional[str] = None,
    ) -> tuple[str, str]:  # (response, call_id)
        """Make a tracked LLM call using a versioned prompt."""

        import time
        import hashlib

        # Get pinned prompt
        version = self.deployment.get_pinned_version(self.environment, prompt_id)
        prompt_obj = self.registry.load(prompt_id, version)
        system = prompt_obj.render(**variables)

        # Track the call
        call_id = str(uuid.uuid4())
        session_id = session_id or str(uuid.uuid4())

        start = time.time()
        response = self.client.messages.create(
            model=model,
            max_tokens=max_tokens,
            system=system,
            messages=[{"role": "user", "content": user_message}]
        )
        latency_ms = (time.time() - start) * 1000

        response_text = response.content[0].text

        record = TrackedLLMCall(
            call_id=call_id,
            prompt_id=prompt_id,
            prompt_version=version,
            prompt_hash=hashlib.sha256(system.encode()).hexdigest()[:12],
            model=model,
            user_id=user_id,
            session_id=session_id,
            request_tokens=response.usage.input_tokens,
            response_tokens=response.usage.output_tokens,
            latency_ms=latency_ms,
            response_text=response_text,
            timestamp=datetime.utcnow().isoformat(),
            experiment_id=experiment_id,
        )
        self.call_log.append(record)

        return response_text, call_id

    def analyze_by_version(self, prompt_id: str) -> dict:
        """Analyze call metrics grouped by prompt version."""
        calls = [c for c in self.call_log if c.prompt_id == prompt_id]
        by_version: dict[str, list[TrackedLLMCall]] = {}

        for call in calls:
            by_version.setdefault(call.prompt_version, []).append(call)

        analysis = {}
        for version, version_calls in by_version.items():
            n = len(version_calls)
            total_tokens = sum(c.request_tokens + c.response_tokens for c in version_calls)
            scored = [c for c in version_calls if c.quality_score is not None]

            analysis[version] = {
                "n_calls": n,
                "avg_latency_ms": sum(c.latency_ms for c in version_calls) / n,
                "avg_total_tokens": total_tokens / n,
                "avg_quality": sum(c.quality_score for c in scored) / len(scored) if scored else None,
            }

        return analysis

Prompt Review Workflow

Regression Testing

Before promoting a new version, verify it doesn't regress on known-good cases.

import anthropic
from dataclasses import dataclass
from typing import Callable


client = anthropic.Anthropic()


@dataclass
class RegressionTestCase:
    """A test case that must pass for any version of a prompt."""
    case_id: str
    input_variables: dict
    user_message: str
    quality_check: Callable[[str], bool]   # Returns True if response is acceptable
    quality_description: str               # Human-readable description of the check


class PromptRegressionSuite:
    """Suite of regression tests that must pass before any deployment."""

    def __init__(self, prompt_id: str, test_cases: list[RegressionTestCase]):
        self.prompt_id = prompt_id
        self.test_cases = test_cases
        self.client = anthropic.Anthropic()

    def run_against_version(
        self,
        registry: PromptRegistry,
        version: str,
    ) -> dict:
        """Run all regression tests against a specific version."""
        prompt = registry.load(self.prompt_id, version)
        results = []

        for tc in self.test_cases:
            system = prompt.render(**tc.input_variables)

            response = self.client.messages.create(
                model="claude-opus-4-6",
                max_tokens=400,
                system=system,
                messages=[{"role": "user", "content": tc.user_message}]
            )
            response_text = response.content[0].text

            passed = tc.quality_check(response_text)
            results.append({
                "case_id": tc.case_id,
                "passed": passed,
                "check": tc.quality_description,
                "response": response_text[:200] + "..." if len(response_text) > 200 else response_text,
            })

        total = len(results)
        passed = sum(1 for r in results if r["passed"])

        return {
            "prompt_id": self.prompt_id,
            "version": version,
            "total": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": passed / total if total > 0 else 0,
            "deployable": passed == total,  # Require 100% pass rate
            "results": results,
        }

    def compare_versions(
        self,
        registry: PromptRegistry,
        version_a: str,
        version_b: str,
    ) -> dict:
        """Compare regression results between two versions."""
        results_a = self.run_against_version(registry, version_a)
        results_b = self.run_against_version(registry, version_b)

        # Find regressions: cases that passed in A but fail in B
        pass_a = {r["case_id"] for r in results_a["results"] if r["passed"]}
        pass_b = {r["case_id"] for r in results_b["results"] if r["passed"]}

        regressions = pass_a - pass_b
        improvements = pass_b - pass_a

        return {
            "control": version_a,
            "treatment": version_b,
            "control_pass_rate": results_a["pass_rate"],
            "treatment_pass_rate": results_b["pass_rate"],
            "regressions": list(regressions),
            "improvements": list(improvements),
            "safe_to_deploy": len(regressions) == 0,
        }


# Example regression suite for a support prompt
support_regression = PromptRegressionSuite(
    prompt_id="support/main",
    test_cases=[
        RegressionTestCase(
            case_id="free_user_upgrade_mention",
            input_variables={"assistant_name": "Nova", "product_name": "Acme", "user_tier": "free", "tone": "friendly"},
            user_message="How do I export my data in CSV format?",
            quality_check=lambda r: "pro" in r.lower() or "upgrade" in r.lower() or "advanced" in r.lower(),
            quality_description="Free user should be mentioned upgrade path for advanced export",
        ),
        RegressionTestCase(
            case_id="no_legal_advice",
            input_variables={"assistant_name": "Nova", "product_name": "Acme", "user_tier": "pro", "tone": "professional"},
            user_message="Is it legal to use your platform for HIPAA-covered data?",
            quality_check=lambda r: any(w in r.lower() for w in ["attorney", "lawyer", "legal counsel", "consult"]),
            quality_description="Should recommend consulting legal counsel for HIPAA questions",
        ),
    ]
)

Common Mistakes

:::danger Editing Prompts In-Place Never edit a published prompt version. Create a new version. The old version must remain accessible for rollback and for understanding why historical outputs look the way they do. "Editing in place" is the equivalent of deploying code with no git history. :::

:::danger Deploying to Production Without Staging The prompt that looks perfect in your notebook can behave differently in production due to real-world input diversity. Always test against real traffic in staging (shadow mode) before touching production traffic. Silent regressions on edge cases are invisible without real traffic exposure. :::

:::warning Not Tracking Prompt Versions in Call Logs If you don't log which prompt version generated each response, you can't do post-hoc analysis. When quality drops, you need to know: "Did this start when we deployed v2.3.1?" Without prompt version in your call logs, you're debugging in the dark. :::

:::tip Prompt Changes Require Review Like Code Changes Significant prompt changes - especially MAJOR version bumps - should require peer review. Prompts are instructions that affect every user interaction. A poorly reviewed change can have as much impact as a poorly reviewed code change. Build this into your engineering culture. :::

Interview Q&A

Q: What does "prompt-as-code" mean and why does it matter?

A: Prompt-as-code means applying the same engineering discipline to prompts that we apply to software: version control with full history, immutable published versions, automated testing, code review for significant changes, deployment pipelines with staging environments, and rollback capabilities. It matters because prompts have the same consequence profile as code: a bad prompt deployed to production affects every user interaction until it's fixed. The asymmetry with traditional code: prompts look harmless (they're just text), so teams often skip the engineering rigor - and then are surprised when they can't roll back or explain why quality changed.

Q: How would you version prompts and what version bump criteria would you use?

A: Use semantic versioning adapted for prompts: MAJOR for changes that fundamentally alter behavior or output schema (a response that used to be JSON now returns markdown - this is a breaking change for consumers), MINOR for improvements that don't break existing behavior (added a new capability, improved quality without changing format), PATCH for tiny fixes (corrected a typo, clarified an ambiguous instruction). Every version is immutable - once published, a version file is never modified. A latest pointer file is updated on each publish and can be overridden for rollback.

Q: How do you A/B test two prompt versions safely?

A: Deterministic bucket assignment: hash(user_id + experiment_id) to assign users to variants. The same user always sees the same variant - this avoids showing inconsistent behavior within a session. Traffic split: start with 5-10% treatment, never jump to 50/50 immediately. Metric selection: pick one primary metric (e.g., session engagement rate) and measure for statistically significant sample sizes (typically 48-96 hours minimum). Two-proportion z-test tells you if the difference is real. Key safety: monitor for extreme negative signals (rejection rate spike, feedback volume surge) that would trigger automatic rollback before the experiment completes.

Q: What should you log for every LLM call to enable prompt performance analysis?

A: Minimum: prompt_id, prompt_version, content_hash (hash of the rendered prompt for exact reproduction), model, request_tokens, response_tokens, latency_ms, user_id, session_id, timestamp. With these fields, you can group calls by prompt version and compare quality metrics over time. When a regression is detected, you can find exactly when it started, which version caused it, and roll back. The content_hash is important for detecting A/B experiment contamination - if two calls have the same prompt_version but different content_hashes, a variable changed, and they're not truly comparable.

Q: How do you handle prompt rollback in a production emergency?

A: Three-step process: First, the deployment config (a git-tracked JSON file) maps environments to pinned prompt versions. Rolling back means updating the pin to the previous version and reloading the config - this takes under 60 seconds and requires no code deployment. Second, if the rollback target is unknown, the call logs tell you which version was in production before the problematic one - you look up the timeline. Third, after the immediate rollback, do a post-mortem: what did the bad version change, why did tests miss it, what regression test should be added to prevent recurrence.

The Silent Regression​

The Prompt-as-Code Philosophy​

Semantic Versioning for Prompts​

A/B Testing Prompt Versions​

Deployment Pinning and Rollback​

Prompt Change Tracking​

Prompt Review Workflow​

Regression Testing​

Common Mistakes​

Interview Q&A​