CI/CD pipelines for LLM applications - handling non-deterministic outputs with LLM-judge gates, canary deployments with quality monitoring, automated rollback triggers, and full GitHub Actions implementation.

How does llm continuous integration work in practice?

LLM CI/CD covers llm cicd, llm continuous integration, llm continuous deployment from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llmops/llm-ci-cd

What is the difference between llm cicd and llm continuous deployment?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llmops/llm-ci-cd

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLMOps Pipeline demo on the EngineersOfAI Playground - no code required. :::

LLM CI/CD

The Deploy That Broke Support

It is a Thursday afternoon in Q2 2024 at a customer success platform. An engineer merges a PR that upgrades the AI email summarization feature from one model version to a newer, cheaper alternative - supposedly faster, smarter, better on benchmarks. The unit tests pass. There are three of them: one checks that the output is a non-empty string, one checks the output is under 500 words, one checks that it contains the substring "summary." All three pass. The engineer deploys to production. Deployment takes eight minutes. The Grafana dashboard stays green. The engineer closes the laptop and goes to lunch.

Twenty-four hours later, the customer success Slack channel is on fire. The AI summaries are wrong. Not wrong in a way that triggers an alert - not empty, not too long, they definitely contain the word "summary." But wrong in the way that matters operationally: they are omitting action items. The new model, without any prompt changes, has different default behavior for structured extraction. It summarizes narrative text beautifully but silently drops the structured sections where action items live. Customer success managers have been sending incomplete handoff notes to clients for twenty-four hours. Clients are escalating. The support team has been re-reading emails manually to reconstruct what the AI missed.

The rollback is trivial - change one string in a config file, redeploy. Nine minutes total. But the damage to customer trust is done, and fixing it means reaching out to every affected client. The root cause of the incident was not the model upgrade - it was that the CI pipeline had three tests that were functionally useless for an LLM application. They tested the shape of the output, not the semantics. A pipeline that cannot catch "the model is now omitting action items" is not a CI pipeline for an LLM application. It is a pipeline-shaped object that provides false confidence.

Why This Exists

Traditional CI/CD is built on a simple contract: given the same input, a deterministic system produces the same output. assert output == expected_string either passes or fails. You can automate this check in milliseconds. LLMs violate this contract in every dimension that matters for testing:

Stochastic outputs. The same prompt at temperature 0.7 returns different text on every call. There is no expected string to assert against.

Semantic quality. "Is this a good summary?" has no single computable answer. The test that matters - "does this response capture the action items?" - requires semantic understanding, not string matching.

Model version sensitivity. Upgrading from claude-3-5-sonnet-20241022 to a different version may change default verbosity, formatting behavior, or handling of specific constructs. These changes are invisible to traditional tests.

Prompt change sensitivity. Adding one sentence to a system prompt can break behavior in unrelated edge cases. The failure is never at the sentence that changed - it is always somewhere in the interaction between the new sentence and existing instructions.

The industry response to this challenge has produced a new CI/CD paradigm for LLM applications, built on three components that replace traditional assertion-based testing:

LLM-as-judge gates - instead of assert output == expected, use a language model to evaluate output quality against a rubric
Golden dataset regression - run every change against a curated set of representative inputs, compare scores statistically to a baseline
Canary deployments with quality monitoring - route a fraction of production traffic to the new version, monitor quality signals from real users before full rollout

The Eval Harness

The eval harness is the Python engine that powers your CI gate. It runs every prompt version against a golden dataset and uses an LLM judge to produce a scored report that either passes or blocks the merge.

# ci/eval_harness.py
"""
LLM eval harness for CI gates.
Provides:
- Golden dataset loading
- LLM-judge scoring with reasoning capture
- Subset analysis by tag (edge cases, adversarial, etc.)
- Structured report with pass/fail determination
- Cost tracking per eval run
"""
import anthropic
import json
import time
import statistics
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional, Callable


# ── Data Structures ───────────────────────────────────────────────────────

@dataclass
class EvalCase:
    """A single case in the golden eval dataset."""
    id: str
    input: str
    context: str = ""
    expected_behavior: str = ""   # What a good response should do (for LLM judge)
    required_elements: list = field(default_factory=list)  # Contains check
    output_pattern: str = ""      # Regex check
    tags: list = field(default_factory=list)
    weight: float = 1.0


@dataclass
class CaseResult:
    """Scored result for one eval case."""
    case_id: str
    input: str
    output: str
    score: float
    strategy: str               # exact_match | contains | regex | llm_judge
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float
    tags: list = field(default_factory=list)
    judge_reasoning: str = ""
    error: Optional[str] = None


@dataclass
class EvalReport:
    """Complete eval run report."""
    prompt_name: str
    prompt_version: str
    model: str
    n_cases: int
    mean_score: float
    std_score: float
    p10_score: float
    p50_score: float
    p90_score: float
    pass_rate: float          # Fraction of individual cases above failure threshold
    mean_latency_ms: float
    total_cost_usd: float
    subset_scores: dict       # tag → mean score
    failures: list            # Cases below failure threshold
    passed: bool
    threshold: float
    delta_from_baseline: Optional[float] = None


# Token cost table (USD per million tokens)
COST_TABLE = {
    "claude-opus-4-6": {"input": 15.0, "output": 75.0},
    "claude-3-5-sonnet-20241022": {"input": 3.0, "output": 15.0},
    "claude-haiku-4-5-20251001": {"input": 0.80, "output": 4.0},
    "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
}


def compute_cost(model: str, in_tokens: int, out_tokens: int) -> float:
    p = COST_TABLE.get(model, {"input": 3.0, "output": 15.0})
    return (in_tokens * p["input"] + out_tokens * p["output"]) / 1_000_000


# ── LLM Judge ─────────────────────────────────────────────────────────────

class LLMJudge:
    """
    LLM-as-judge: evaluates output quality against expected behavior.
    Uses claude-haiku-4-5-20251001 for cost efficiency:
    ~$0.005 per judgment vs ~$0.03 for Sonnet.
    """

    def __init__(self, judge_model: str = "claude-haiku-4-5-20251001"):
        self.client = anthropic.Anthropic()
        self.model = judge_model

    def score(
        self,
        user_input: str,
        output: str,
        expected_behavior: str,
        reference: str = "",
    ) -> tuple[float, str]:
        """
        Score an output against expected behavior.
        Returns (score in [0,1], one-sentence reasoning).
        """
        ref_section = (
            f"\nReference answer for comparison:\n{reference}"
            if reference else ""
        )

        prompt = f"""You are an expert evaluator for an AI assistant.

User input: {user_input[:400]}

Expected behavior (what the response must do):
{expected_behavior}
{ref_section}

Actual AI response:
{output[:800]}

Evaluate on a 0.0-1.0 scale:
1.0 - Fully satisfies expected behavior, accurate, complete
0.75 - Mostly satisfies, minor gaps
0.5 - Partially satisfies, missing key elements
0.25 - Significant problems or missing critical content
0.0 - Fails entirely, is harmful, or produces wrong output

Write exactly one sentence of reasoning.
Then on a new line write: SCORE: [decimal]

Example:
The response correctly identifies all required fields but omits the date range.
SCORE: 0.75"""

        response = self.client.messages.create(
            model=self.model,
            max_tokens=150,
            messages=[{"role": "user", "content": prompt}],
        )
        text = response.content[0].text.strip()

        score = 0.5
        reasoning = text
        for line in text.split("\n"):
            if line.startswith("SCORE:"):
                try:
                    score = float(line.replace("SCORE:", "").strip())
                    reasoning = text.replace(line, "").strip()
                    break
                except ValueError:
                    pass

        return max(0.0, min(1.0, score)), reasoning


# ── Eval Harness ──────────────────────────────────────────────────────────

class EvalHarness:
    """
    Runs a prompt version against a golden dataset and produces a scored EvalReport.
    This is the core of the CI gate - every prompt PR runs through this.

    Design goals:
    - Fast enough for CI (target: 100 cases in under 10 minutes)
    - Comprehensive enough to catch real regressions
    - Cheap enough to run on every PR (use Haiku as judge)
    """

    def __init__(
        self,
        pass_threshold: float = 0.85,
        failure_threshold: float = 0.60,  # Individual case below this = flagged failure
        judge: Optional[LLMJudge] = None,
    ):
        self.client = anthropic.Anthropic()
        self.pass_threshold = pass_threshold
        self.failure_threshold = failure_threshold
        self.judge = judge or LLMJudge()

    def run(
        self,
        prompt_name: str,
        prompt_version: str,
        system_prompt: str,
        model: str,
        max_tokens: int,
        eval_cases: list[EvalCase],
        temperature: float = 0.0,  # Use 0 for deterministic CI runs
        baseline_score: Optional[float] = None,
    ) -> EvalReport:
        """
        Run the full eval harness.

        Note: temperature=0.0 is recommended for CI runs. It does not eliminate
        stochasticity at the model level but reduces variance enough to make
        the threshold comparison meaningful.
        """
        if not eval_cases:
            raise ValueError("No eval cases provided.")

        print(f"Eval: {prompt_name}@{prompt_version} | {model} | {len(eval_cases)} cases")
        results = []

        for i, case in enumerate(eval_cases):
            result = self._run_case(case, system_prompt, model, max_tokens, temperature)
            results.append(result)
            status = "PASS" if result.score >= self.failure_threshold else "FAIL"
            print(
                f"  [{i+1:3d}/{len(eval_cases)}] {case.id:45s} "
                f"score={result.score:.2f} [{status}] ({result.strategy})"
            )

        return self._build_report(
            prompt_name, prompt_version, model, results, baseline_score
        )

    def _run_case(
        self,
        case: EvalCase,
        system: str,
        model: str,
        max_tokens: int,
        temperature: float,
    ) -> CaseResult:
        """Run one eval case: generate output, then score it."""
        user_content = case.input
        if case.context:
            user_content = f"{case.input}\n\nContext:\n{case.context}"

        # Generate model output
        start = time.monotonic()
        try:
            response = self.client.messages.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                system=system,
                messages=[{"role": "user", "content": user_content}],
            )
            latency_ms = (time.monotonic() - start) * 1000
            output = response.content[0].text
            in_tok = response.usage.input_tokens
            out_tok = response.usage.output_tokens
            cost = compute_cost(model, in_tok, out_tok)
        except Exception as e:
            return CaseResult(
                case_id=case.id, input=case.input, output="",
                score=0.0, strategy="error", latency_ms=0,
                input_tokens=0, output_tokens=0, cost_usd=0,
                tags=case.tags, error=str(e),
            )

        # Choose and run eval strategy
        score, strategy, reasoning = self._score(case, output)

        return CaseResult(
            case_id=case.id, input=case.input, output=output,
            score=score, strategy=strategy,
            latency_ms=latency_ms, input_tokens=in_tok,
            output_tokens=out_tok, cost_usd=cost,
            tags=case.tags, judge_reasoning=reasoning,
        )

    def _score(
        self, case: EvalCase, output: str
    ) -> tuple[float, str, str]:
        """Select and run the best eval strategy for this case."""
        import re

        # 1. Exact match (highest precision, use when output is deterministic)
        if case.required_elements == ["__EXACT__"]:
            # Special sentinel for exact match mode
            score = 1.0 if output.strip() == case.expected_behavior.strip() else 0.0
            return score, "exact_match", ""

        # 2. Contains check (all required elements must appear)
        if case.required_elements:
            out_lower = output.lower()
            found = sum(1 for el in case.required_elements if el.lower() in out_lower)
            return found / len(case.required_elements), "contains", ""

        # 3. Regex pattern check
        if case.output_pattern:
            score = 1.0 if re.search(case.output_pattern, output, re.DOTALL) else 0.0
            return score, "regex", ""

        # 4. LLM judge (most flexible, highest quality signal)
        if case.expected_behavior:
            score, reasoning = self.judge.score(
                case.input, output, case.expected_behavior
            )
            return score, "llm_judge", reasoning

        # 5. No eval strategy - always pass (format-only tests)
        return 1.0, "none", ""

    def _build_report(
        self,
        prompt_name: str,
        prompt_version: str,
        model: str,
        results: list[CaseResult],
        baseline_score: Optional[float],
    ) -> EvalReport:
        valid = [r for r in results if r.error is None]
        scores = [r.score for r in valid]

        if not scores:
            raise RuntimeError("All eval cases returned errors. Check API key and model name.")

        sorted_scores = sorted(scores)
        n = len(sorted_scores)
        pct = lambda p: sorted_scores[min(int(n * p), n - 1)]

        # Subset scores by tag
        tag_scores: dict[str, list[float]] = {}
        for r in valid:
            for tag in r.tags:
                tag_scores.setdefault(tag, []).append(r.score)

        mean_score = sum(scores) / len(scores)

        return EvalReport(
            prompt_name=prompt_name,
            prompt_version=prompt_version,
            model=model,
            n_cases=len(results),
            mean_score=mean_score,
            std_score=statistics.stdev(scores) if len(scores) > 1 else 0.0,
            p10_score=pct(0.10),
            p50_score=pct(0.50),
            p90_score=pct(0.90),
            pass_rate=sum(1 for s in scores if s >= self.failure_threshold) / len(scores),
            mean_latency_ms=sum(r.latency_ms for r in valid) / len(valid),
            total_cost_usd=sum(r.cost_usd for r in results),
            subset_scores={tag: sum(s)/len(s) for tag, s in tag_scores.items()},
            failures=[r for r in valid if r.score < self.failure_threshold],
            passed=mean_score >= self.pass_threshold,
            threshold=self.pass_threshold,
            delta_from_baseline=(
                mean_score - baseline_score if baseline_score is not None else None
            ),
        )

    def print_report(self, report: EvalReport) -> None:
        """Print a formatted report to stdout."""
        status = "PASSED" if report.passed else "FAILED"
        delta_str = ""
        if report.delta_from_baseline is not None:
            sign = "+" if report.delta_from_baseline >= 0 else ""
            delta_str = f" ({sign}{report.delta_from_baseline:.3f} vs baseline)"

        print(f"\n{'='*65}")
        print(f"EVAL: {report.prompt_name}@{report.prompt_version} [{status}]{delta_str}")
        print(f"Model: {report.model}")
        print(f"{'='*65}")
        print(f"Score:    mean={report.mean_score:.3f}  std={report.std_score:.3f}")
        print(f"          p10={report.p10_score:.3f}  p50={report.p50_score:.3f}  "
              f"p90={report.p90_score:.3f}")
        print(f"Pass rate: {report.pass_rate:.1%} of cases above {self.failure_threshold}")
        print(f"Latency:   {report.mean_latency_ms:.0f}ms avg")
        print(f"Cost:      ${report.total_cost_usd:.4f} total")

        if report.subset_scores:
            print(f"\nSubset scores:")
            for tag, score in sorted(report.subset_scores.items()):
                ok = "OK  " if score >= report.threshold else "FAIL"
                print(f"  [{ok}] {tag:35s} {score:.3f}")

        if report.failures:
            print(f"\nFailed cases ({len(report.failures)}):")
            for r in report.failures[:5]:
                print(f"  [{r.case_id}] score={r.score:.2f} ({r.strategy})")
                print(f"    Input:  {r.input[:80]}...")
                if r.judge_reasoning:
                    print(f"    Judge:  {r.judge_reasoning[:100]}...")
        print(f"{'='*65}\n")

    def save_report(self, report: EvalReport, output_path: str) -> None:
        """Save report as JSON for CI artifact upload and historical tracking."""
        data = {
            **report.__dict__,
            "failures": [r.__dict__ for r in report.failures],
        }
        with open(output_path, "w") as f:
            json.dump(data, f, indent=2)

GitHub Actions CI Pipeline

The CI pipeline runs automatically on every pull request that touches prompt files or LLM-related application code. Speed matters: if the pipeline takes 30 minutes, engineers work around it.

# .github/workflows/llm-ci.yml
name: LLM CI Pipeline

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'evals/**'

env:
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  PASS_THRESHOLD: "0.85"

jobs:
  # ── Stage 1: Schema and format validation (fast, free) ────────────────
  schema-check:
    name: Schema Validation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install pyyaml jsonschema
      - name: Validate prompt YAML schemas
        run: python ci/validate_schemas.py
      - name: Validate eval dataset JSON
        run: python ci/validate_eval_datasets.py

  # ── Stage 2: Eval gate (LLM judge, blocks merge on failure) ───────────
  eval-gate:
    name: Prompt Eval Gate
    runs-on: ubuntu-latest
    needs: schema-check
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Cache pip dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: pip-${{ hashFiles('requirements-ci.txt') }}

      - run: pip install anthropic pyyaml

      - name: Detect changed prompt files
        id: detect
        run: |
          git fetch origin ${{ github.base_ref }}
          git diff --name-only origin/${{ github.base_ref }}...HEAD \
            | grep '^prompts/' > changed_prompts.txt || true
          echo "count=$(wc -l < changed_prompts.txt | tr -d ' ')" >> $GITHUB_OUTPUT
          echo "Changed prompts:"
          cat changed_prompts.txt

      - name: Run eval gate
        if: steps.detect.outputs.count != '0'
        id: eval
        run: |
          python ci/run_eval_gate.py \
            --changed-files changed_prompts.txt \
            --threshold ${{ env.PASS_THRESHOLD }} \
            --output eval_report.json
        continue-on-error: true  # Don't fail here - fail in the next step after posting comment

      - name: Post eval results as PR comment
        if: steps.detect.outputs.count != '0'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            let body = '## LLM Eval Gate Results\n\n';
            try {
              const reports = JSON.parse(fs.readFileSync('eval_report.json'));
              for (const r of reports) {
                const icon = r.passed ? '✅' : '❌';
                body += `### ${icon} ${r.prompt_name}@${r.prompt_version}\n\n`;
                body += `| Metric | Value |\n|--------|-------|\n`;
                body += `| Mean score | \`${r.mean_score.toFixed(3)}\` |\n`;
                body += `| Threshold | \`${r.threshold}\` |\n`;
                body += `| Pass rate | \`${(r.pass_rate * 100).toFixed(1)}%\` |\n`;
                body += `| Cost | \`$${r.total_cost_usd.toFixed(4)}\` |\n`;
                body += `| Cases | ${r.n_cases} |\n\n`;

                if (r.subset_scores && Object.keys(r.subset_scores).length > 0) {
                  body += `**Subset scores:**\n`;
                  for (const [tag, score] of Object.entries(r.subset_scores)) {
                    const icon2 = score >= r.threshold ? '✅' : '⚠️';
                    body += `- ${icon2} \`${tag}\`: ${parseFloat(score).toFixed(3)}\n`;
                  }
                  body += '\n';
                }

                if (r.failures && r.failures.length > 0) {
                  body += `**Failed cases (${r.failures.length}):**\n`;
                  for (const f of r.failures.slice(0, 3)) {
                    body += `- score=\`${f.score.toFixed(2)}\`: ${f.input.slice(0, 80)}...\n`;
                    if (f.judge_reasoning) {
                      body += `  _${f.judge_reasoning.slice(0, 100)}_\n`;
                    }
                  }
                }
              }
            } catch (e) {
              body += '_Could not read eval results._\n';
            }
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body,
            });

      - name: Fail build if eval gate failed
        if: steps.detect.outputs.count != '0' && steps.eval.outcome == 'failure'
        run: |
          echo "Eval gate FAILED. Review the PR comment for details."
          exit 1

      - name: Upload eval report artifact
        if: steps.detect.outputs.count != '0'
        uses: actions/upload-artifact@v3
        with:
          name: eval-report-${{ github.sha }}
          path: eval_report.json

Canary Deployment with Quality Monitoring

After the CI gate passes, the canary deployment system monitors real production traffic for quality regression before full rollout.

# deployment/canary.py
"""
Canary deployment system for LLM features.
Routes a configurable percentage of traffic to the new version,
scores quality for both versions, and triggers rollback if signals degrade.
"""
import anthropic
import json
import hashlib
import time
import statistics
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional


@dataclass
class CanaryConfig:
    feature_name: str
    incumbent_version: str
    canary_version: str
    canary_pct: float = 0.05           # 5% to canary by default
    monitor_hours: int = 48            # How long to monitor before decision
    min_canary_samples: int = 100      # Min requests before deciding
    quality_drop_threshold: float = 0.05   # Absolute score drop triggers rollback
    error_rate_threshold: float = 0.02    # 2% error rate triggers rollback
    cost_spike_threshold: float = 0.50   # 50% cost increase triggers alert


class CanaryMonitor:
    """
    Records per-version quality, error, and cost metrics.
    Evaluates canary health and recommends action.
    """

    def __init__(self, config: CanaryConfig, store_dir: str = "canary_data"):
        self.config = config
        self.store = Path(store_dir)
        self.store.mkdir(exist_ok=True)
        self._metrics: dict[str, list[dict]] = {
            "incumbent": [],
            "canary": [],
        }
        self._load()

    def route(self, user_id: str) -> str:
        """
        Route a request to incumbent or canary.
        Uses consistent hashing so the same user always gets the same version.
        Avoids confusing users who would see different behavior on different requests.
        """
        bucket = int(
            hashlib.md5(f"{user_id}:{self.config.feature_name}".encode()).hexdigest(),
            16
        ) % 100
        if bucket < int(self.config.canary_pct * 100):
            return self.config.canary_version
        return self.config.incumbent_version

    def record(
        self,
        version: str,
        quality_score: float,
        latency_ms: float,
        cost_usd: float,
        error: bool = False,
    ) -> None:
        """Record metrics for one request."""
        group = "canary" if version == self.config.canary_version else "incumbent"
        self._metrics[group].append({
            "ts": datetime.utcnow().isoformat(),
            "score": quality_score,
            "latency_ms": latency_ms,
            "cost_usd": cost_usd,
            "error": error,
        })
        self._save()

    def evaluate(self) -> dict:
        """
        Evaluate canary health. Returns:
        {
          "action": "continue" | "promote" | "rollback" | "alert",
          "reason": str,
          "incumbent_score": float,
          "canary_score": float,
          ...
        }
        """
        can = self._metrics["canary"]
        inc = self._metrics["incumbent"]

        if len(can) < self.config.min_canary_samples:
            return {
                "action": "continue",
                "reason": f"Insufficient canary samples: {len(can)}/{self.config.min_canary_samples}",
                "canary_samples": len(can),
            }

        # Core metrics
        can_errors = [m for m in can if m["error"]]
        can_ok = [m for m in can if not m["error"]]
        inc_ok = [m for m in inc if not m["error"]]

        can_error_rate = len(can_errors) / len(can) if can else 0
        can_scores = [m["score"] for m in can_ok]
        inc_scores = [m["score"] for m in inc_ok]

        can_mean = statistics.mean(can_scores) if can_scores else 0
        inc_mean = statistics.mean(inc_scores) if inc_scores else 0
        score_delta = can_mean - inc_mean

        can_cost = statistics.mean(m["cost_usd"] for m in can_ok) if can_ok else 0
        inc_cost = statistics.mean(m["cost_usd"] for m in inc_ok) if inc_ok else 0
        cost_delta_pct = (can_cost - inc_cost) / inc_cost if inc_cost > 0 else 0

        base_result = {
            "canary_samples": len(can),
            "incumbent_score": inc_mean,
            "canary_score": can_mean,
            "score_delta": score_delta,
            "canary_error_rate": can_error_rate,
            "cost_delta_pct": cost_delta_pct,
        }

        # 1. Error rate spike → rollback immediately
        if can_error_rate > self.config.error_rate_threshold:
            return {
                **base_result,
                "action": "rollback",
                "reason": (
                    f"Canary error rate {can_error_rate:.1%} exceeds "
                    f"threshold {self.config.error_rate_threshold:.1%}"
                ),
            }

        # 2. Quality drop → rollback
        if score_delta < -self.config.quality_drop_threshold:
            return {
                **base_result,
                "action": "rollback",
                "reason": (
                    f"Canary quality drop {score_delta:.3f} exceeds "
                    f"threshold -{self.config.quality_drop_threshold}"
                ),
            }

        # 3. Cost spike → alert (not rollback - may be intentional)
        if cost_delta_pct > self.config.cost_spike_threshold:
            return {
                **base_result,
                "action": "alert",
                "reason": (
                    f"Canary cost {cost_delta_pct:.1%} higher than incumbent. "
                    f"Investigate before promoting."
                ),
            }

        # 4. Monitoring window complete → promote
        if can:
            oldest = datetime.fromisoformat(can[0]["ts"])
            elapsed = datetime.utcnow() - oldest
            if elapsed >= timedelta(hours=self.config.monitor_hours):
                if score_delta >= -0.01:  # Canary is equivalent or better
                    return {
                        **base_result,
                        "action": "promote",
                        "reason": (
                            f"Monitoring window complete. "
                            f"Canary score {can_mean:.3f} vs incumbent {inc_mean:.3f} "
                            f"(delta {score_delta:+.3f})."
                        ),
                    }

        return {
            **base_result,
            "action": "continue",
            "reason": "Canary within bounds, monitoring continues.",
        }

    def _save(self) -> None:
        path = self.store / f"{self.config.feature_name}_metrics.json"
        path.write_text(json.dumps(self._metrics, indent=2))

    def _load(self) -> None:
        path = self.store / f"{self.config.feature_name}_metrics.json"
        if path.exists():
            self._metrics = json.loads(path.read_text())


# ── Integration: Request with canary routing ──────────────────────────────

def handle_request_with_canary(
    user_id: str,
    user_message: str,
    monitor: CanaryMonitor,
    system_prompts: dict[str, str],  # version → system prompt text
    judge: "LLMJudge",  # from eval_harness.py
) -> str:
    """
    Handle a production request through the canary system.
    Records quality metrics for monitoring and decision-making.
    """
    client = anthropic.Anthropic()
    version = monitor.route(user_id)
    system = system_prompts[version]

    start = time.monotonic()
    error = False
    output = ""
    cost = 0.0

    try:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            system=system,
            messages=[{"role": "user", "content": user_message}],
        )
        output = response.content[0].text
        cost = compute_cost(
            "claude-3-5-sonnet-20241022",
            response.usage.input_tokens,
            response.usage.output_tokens,
        )
    except Exception:
        error = True

    latency_ms = (time.monotonic() - start) * 1000

    # Score quality asynchronously in production; synchronously here for simplicity
    quality_score = 0.5
    if not error and output:
        quality_score, _ = judge.score(
            user_message, output,
            "Provide an accurate, helpful, concise response"
        )

    monitor.record(version, quality_score, latency_ms, cost, error)

    # Check canary health (in production, run this on a background scheduler)
    decision = monitor.evaluate()
    if decision["action"] == "rollback":
        # In production: trigger PagerDuty, update feature flag config, alert team
        print(f"CANARY ROLLBACK TRIGGERED: {decision['reason']}")
    elif decision["action"] == "promote":
        print(f"CANARY READY TO PROMOTE: {decision['reason']}")

    return output

Rollback Architecture

The most important property of a rollback is that it must be fast and not require a redeploy. A redeploy takes 5–15 minutes. An incident that is visible to users for 15 minutes is measurably worse than one visible for 30 seconds. Build your rollback mechanism so it operates at the configuration level, not the deployment level.

# deployment/rollback.py
"""
Rollback manager - changes the active prompt/model version via a config file.
Application code reads this config on each request (or with a short-lived cache).
In production, replace with a feature flag service (LaunchDarkly, Unleash, Split).
"""
import json
import time
from pathlib import Path
from datetime import datetime


class RollbackManager:
    """
    File-based feature configuration with rollback support.
    The application polls this config (or receives a push update) and switches
    behavior without requiring a code deploy.
    """

    def __init__(self, config_path: str = "feature_config.json"):
        self.path = Path(config_path)
        self._config: dict = {}
        self._load()

    def _load(self) -> None:
        if self.path.exists():
            self._config = json.loads(self.path.read_text())

    def _save(self) -> None:
        self.path.write_text(json.dumps(self._config, indent=2))

    def set_version(
        self,
        feature: str,
        version: str,
        reason: str = "",
        operator: str = "",
    ) -> None:
        """Set the active version for a feature. Appends to change log."""
        entry = {
            "version": version,
            "set_at": datetime.utcnow().isoformat(),
            "reason": reason,
            "operator": operator,
        }
        self._config[feature] = entry
        self._save()

        # Append to audit log
        log_path = self.path.parent / f"{feature}_audit.jsonl"
        with open(log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

        print(f"[{feature}] active version → {version}"
              + (f" ({reason})" if reason else ""))

    def get_version(self, feature: str, default: str = "latest") -> str:
        """Get the currently active version for a feature."""
        return self._config.get(feature, {}).get("version", default)

    def rollback(
        self,
        feature: str,
        to_version: str,
        reason: str,
        operator: str = "automated",
    ) -> None:
        """
        Immediately roll back a feature to a known-good version.
        This is the emergency brake - call it when canary signals trigger.
        """
        self.set_version(
            feature, to_version,
            reason=f"ROLLBACK: {reason}",
            operator=operator,
        )
        print(f"\nROLLBACK COMPLETE")
        print(f"  Feature:  {feature}")
        print(f"  Version:  {to_version}")
        print(f"  Reason:   {reason}")
        print(f"  Operator: {operator}")
        print(f"  Time:     {datetime.utcnow().isoformat()}")

    def get_audit_log(self, feature: str, limit: int = 20) -> list[dict]:
        """Retrieve recent version changes for a feature."""
        log_path = self.path.parent / f"{feature}_audit.jsonl"
        if not log_path.exists():
            return []
        with open(log_path) as f:
            lines = [json.loads(l) for l in f if l.strip()]
        return list(reversed(lines))[-limit:]

Nightly Regression Scheduling

Beyond PR gates, run your eval harness on a nightly schedule to catch provider-side model updates that happen without notice:

# .github/workflows/nightly-regression.yml
name: Nightly Regression Check

on:
  schedule:
    - cron: '0 2 * * *'   # 2 AM UTC daily
  workflow_dispatch:         # Allow manual trigger

jobs:
  full-regression:
    name: Full Regression Suite
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install anthropic pyyaml

      - name: Run full regression suite (all production prompts)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python ci/nightly_regression.py \
            --all-production-prompts \
            --threshold 0.82 \
            --output nightly_report.json

      - name: Upload report artifact
        uses: actions/upload-artifact@v3
        with:
          name: nightly-regression-${{ github.run_id }}
          path: nightly_report.json

      - name: Alert on regression via Slack
        if: failure()
        uses: 8398a7/action-slack@v3
        with:
          status: failure
          text: |
            Nightly regression detected quality regression.
            Review artifact: nightly-regression-${{ github.run_id }}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Production Engineering Notes

Eval Speed Is Not Optional

If your CI eval gate takes 30 minutes, engineers will find ways to work around it - merging prompt changes without waiting for CI, flagging tests as flaky, or simply not tagging prompt changes as prompt changes. Target under 10 minutes for your CI eval run. Strategies: keep the fast regression set small (30–50 cases is enough for meaningful signals), use claude-haiku-4-5-20251001 as the judge (not Opus), run eval cases concurrently where possible, and cache results for unchanged prompt versions using the prompt hash.

Separate Fast and Comprehensive Eval Sets

Maintain two eval datasets: a fast regression set (20–30 cases, runs in CI on every PR, 5–8 minutes) and a comprehensive eval set (200+ cases, runs nightly or before major releases). The fast set catches regressions quickly without blocking development. The comprehensive set gives you confident accuracy estimates. The two sets should have overlapping but not identical cases - overlap for consistency comparison, different cases for coverage breadth.

Track Eval Costs as Infrastructure Costs

Running an LLM judge eval on 50 cases twice a day costs real money - approximately $2–5 per day depending on the judge model and case complexity. Over a year, this is$ 700–1,800. Track this explicitly in your infrastructure cost dashboard. A reasonable ceiling: eval costs should not exceed 5% of production inference costs. If you are spending more than that on eval relative to production, you are over-evaluating; if you are spending less than 1%, you are under-evaluating and your quality monitoring is insufficient.

:::warning Non-Determinism Requires Conservative Thresholds Do not set your CI pass threshold at a value that changes with random seed variance. If your prompt consistently scores 0.87 mean and your threshold is 0.87, you will have intermittent CI failures from sampling noise. Set your threshold at least 3–5 points below your actual expected score. For a prompt you expect to score 0.88, set the threshold at 0.83. Use multiple eval runs and average the scores if variance is high. :::

:::danger Never Skip the Eval Gate Under Deadline Pressure The most common cause of production LLM incidents is "we had to ship fast and the eval takes too long." If the deadline is real, make the eval faster - smaller golden dataset, faster judge model, parallel execution. If the eval cannot be made faster, negotiate the deadline. A production LLM incident under deadline pressure costs more in remediation, customer trust repair, and engineering time than the deadline was worth. :::

:::tip Use temperature=0.0 for CI Runs Setting temperature to 0.0 for eval runs reduces output variance significantly, making CI pass/fail decisions more stable. It does not eliminate all non-determinism at the model level (hosted models have additional sources of variance), but it reduces the LLM-side variance enough that threshold comparisons become reliable. Run production at your actual temperature; run CI evals at 0.0. :::

Interview Q&A

Q1: Why does traditional CI fail for LLM applications, and what replaces it?

Traditional CI relies on deterministic assertions: assert output == expected. This fails for LLMs because outputs are stochastic - the same prompt returns different text on every call - and because quality is semantic, not structural. You cannot write a regex that detects "the model is now omitting action items from summaries." The replacement is a three-part system: schema validation for structural requirements (output is valid JSON, required fields present, response is non-empty), LLM-as-judge scoring for semantic quality (uses a capable model to evaluate output against a rubric, returning a float score), and statistical thresholding (mean score over a golden dataset must exceed a threshold, with subset analysis to catch partial regressions). None of these are perfect individually, but together they catch the failure modes that traditional CI cannot see.

Q2: How do you design a golden eval dataset for CI? What makes a good eval case?

A good golden eval dataset has four properties. Representative: covers the actual distribution of production queries, including common cases, edge cases, and failure-prone inputs - not just the easy happy path. Diverse: cases are labeled with tags (core, edge-case, adversarial, format) so subset scores can be computed - a regression that only affects edge cases is invisible without tag-level reporting. Annotated with expected behavior: each case has a description of what a correct response must do, written precisely enough for an LLM judge to evaluate against. Weighted: edge cases and adversarial cases should have higher weight than routine core cases. 50–100 well-designed cases with these properties provides more CI value than 500 poorly designed cases. Add cases aggressively after every production incident - each incident is a gap in your eval coverage.

Q3: Walk me through a canary deployment for an LLM application end-to-end.

After the CI eval gate passes and the PR merges, the canary begins. Deploy the new prompt/model version to your serving infrastructure but configure routing to send only 5% of production traffic to it. Use consistent hashing on user ID so the same user always receives the same version within their session - inconsistency within a session confuses users and muddies the quality signal. For every request in both the incumbent and canary cohorts, score quality using an LLM judge (asynchronously or on a sampled basis) and record score, latency, error flag, and cost to a metrics store. Monitor two primary signals: quality score differential (is canary's rolling mean within 0.05 of incumbent's?) and error rate (is canary producing more API errors or format failures?). If either signal degrades beyond threshold, trigger automatic rollback - update the routing config, no redeploy required, sub-30-second recovery. If both signals hold for 48 hours with at least 100 canary samples, promote the canary to 100% and deprecate the incumbent.

Q4: How do you handle LLM provider model updates that happen without notice?

Three defenses in depth. First, pin model versions explicitly in your code and configuration - use claude-3-5-sonnet-20241022 with the full date suffix, not just claude-3-5-sonnet. Most providers support explicit version IDs. Second, run your full regression eval suite on a nightly schedule, not just on code changes. If Anthropic updates the underlying model on a Monday, your nightly eval on Tuesday catches it. Third, maintain production quality monitoring: if you score 5–10% of production requests and maintain a rolling quality score alert, an unexpected model provider change shows up as an anomaly in your production metrics within hours. When you want to intentionally upgrade model versions, treat it like a prompt change: run a head-to-head eval on your golden dataset before routing any production traffic to the new version.

Q5: What is the minimum viable CI/CD setup for a two-person team shipping an LLM feature?

Three things, achievable in one day. First, a 20-example golden dataset in a JSON file - cover the core task, two edge cases, and one adversarial case. Second, a Python script that runs each example through the current prompt and scores the output with claude-haiku-4-5-20251001 as judge, printing a mean score and failing with exit code 1 if the score drops below 0.80. Third, a GitHub Actions workflow that runs this script on every PR that touches files under prompts/. That is the complete minimum. It will catch the failure mode that caused the example at the start of this lesson - the model upgrade that silently dropped action items - because the action items are required elements in the eval cases and the judge would have flagged their absence. Total setup time: one engineering day. Total cost per CI run: approximately $0.05. The ROI is immediate and measurable.

The Deploy That Broke Support​

Why This Exists​

The Eval Harness​

GitHub Actions CI Pipeline​

Canary Deployment with Quality Monitoring​

Rollback Architecture​

Nightly Regression Scheduling​

Production Engineering Notes​

Eval Speed Is Not Optional​

Separate Fast and Comprehensive Eval Sets​

Track Eval Costs as Infrastructure Costs​

Interview Q&A​