Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLMOps Pipeline demo on the EngineersOfAI Playground - no code required. :::

LLM CI/CD

The Deploy That Broke Support

It is a Thursday afternoon in Q2 2024 at a customer success platform. An engineer merges a PR that upgrades the AI email summarization feature from one model version to a newer, cheaper alternative - supposedly faster, smarter, better on benchmarks. The unit tests pass. There are three of them: one checks that the output is a non-empty string, one checks the output is under 500 words, one checks that it contains the substring "summary." All three pass. The engineer deploys to production. Deployment takes eight minutes. The Grafana dashboard stays green. The engineer closes the laptop and goes to lunch.

Twenty-four hours later, the customer success Slack channel is on fire. The AI summaries are wrong. Not wrong in a way that triggers an alert - not empty, not too long, they definitely contain the word "summary." But wrong in the way that matters operationally: they are omitting action items. The new model, without any prompt changes, has different default behavior for structured extraction. It summarizes narrative text beautifully but silently drops the structured sections where action items live. Customer success managers have been sending incomplete handoff notes to clients for twenty-four hours. Clients are escalating. The support team has been re-reading emails manually to reconstruct what the AI missed.

The rollback is trivial - change one string in a config file, redeploy. Nine minutes total. But the damage to customer trust is done, and fixing it means reaching out to every affected client. The root cause of the incident was not the model upgrade - it was that the CI pipeline had three tests that were functionally useless for an LLM application. They tested the shape of the output, not the semantics. A pipeline that cannot catch "the model is now omitting action items" is not a CI pipeline for an LLM application. It is a pipeline-shaped object that provides false confidence.

Why This Exists

Traditional CI/CD is built on a simple contract: given the same input, a deterministic system produces the same output. assert output == expected_string either passes or fails. You can automate this check in milliseconds. LLMs violate this contract in every dimension that matters for testing:

Stochastic outputs. The same prompt at temperature 0.7 returns different text on every call. There is no expected string to assert against.

Semantic quality. "Is this a good summary?" has no single computable answer. The test that matters - "does this response capture the action items?" - requires semantic understanding, not string matching.

Model version sensitivity. Upgrading from claude-3-5-sonnet-20241022 to a different version may change default verbosity, formatting behavior, or handling of specific constructs. These changes are invisible to traditional tests.

Prompt change sensitivity. Adding one sentence to a system prompt can break behavior in unrelated edge cases. The failure is never at the sentence that changed - it is always somewhere in the interaction between the new sentence and existing instructions.

The industry response to this challenge has produced a new CI/CD paradigm for LLM applications, built on three components that replace traditional assertion-based testing:

  1. LLM-as-judge gates - instead of assert output == expected, use a language model to evaluate output quality against a rubric
  2. Golden dataset regression - run every change against a curated set of representative inputs, compare scores statistically to a baseline
  3. Canary deployments with quality monitoring - route a fraction of production traffic to the new version, monitor quality signals from real users before full rollout

The Eval Harness

The eval harness is the Python engine that powers your CI gate. It runs every prompt version against a golden dataset and uses an LLM judge to produce a scored report that either passes or blocks the merge.

# ci/eval_harness.py
"""
LLM eval harness for CI gates.
Provides:
- Golden dataset loading
- LLM-judge scoring with reasoning capture
- Subset analysis by tag (edge cases, adversarial, etc.)
- Structured report with pass/fail determination
- Cost tracking per eval run
"""
import anthropic
import json
import time
import statistics
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional, Callable


# ── Data Structures ───────────────────────────────────────────────────────

@dataclass
class EvalCase:
"""A single case in the golden eval dataset."""
id: str
input: str
context: str = ""
expected_behavior: str = "" # What a good response should do (for LLM judge)
required_elements: list = field(default_factory=list) # Contains check
output_pattern: str = "" # Regex check
tags: list = field(default_factory=list)
weight: float = 1.0


@dataclass
class CaseResult:
"""Scored result for one eval case."""
case_id: str
input: str
output: str
score: float
strategy: str # exact_match | contains | regex | llm_judge
latency_ms: float
input_tokens: int
output_tokens: int
cost_usd: float
tags: list = field(default_factory=list)
judge_reasoning: str = ""
error: Optional[str] = None


@dataclass
class EvalReport:
"""Complete eval run report."""
prompt_name: str
prompt_version: str
model: str
n_cases: int
mean_score: float
std_score: float
p10_score: float
p50_score: float
p90_score: float
pass_rate: float # Fraction of individual cases above failure threshold
mean_latency_ms: float
total_cost_usd: float
subset_scores: dict # tag → mean score
failures: list # Cases below failure threshold
passed: bool
threshold: float
delta_from_baseline: Optional[float] = None


# Token cost table (USD per million tokens)
COST_TABLE = {
"claude-opus-4-6": {"input": 15.0, "output": 75.0},
"claude-3-5-sonnet-20241022": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5-20251001": {"input": 0.80, "output": 4.0},
"claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
}


def compute_cost(model: str, in_tokens: int, out_tokens: int) -> float:
p = COST_TABLE.get(model, {"input": 3.0, "output": 15.0})
return (in_tokens * p["input"] + out_tokens * p["output"]) / 1_000_000


# ── LLM Judge ─────────────────────────────────────────────────────────────

class LLMJudge:
"""
LLM-as-judge: evaluates output quality against expected behavior.
Uses claude-haiku-4-5-20251001 for cost efficiency:
~$0.005 per judgment vs ~$0.03 for Sonnet.
"""

def __init__(self, judge_model: str = "claude-haiku-4-5-20251001"):
self.client = anthropic.Anthropic()
self.model = judge_model

def score(
self,
user_input: str,
output: str,
expected_behavior: str,
reference: str = "",
) -> tuple[float, str]:
"""
Score an output against expected behavior.
Returns (score in [0,1], one-sentence reasoning).
"""
ref_section = (
f"\nReference answer for comparison:\n{reference}"
if reference else ""
)

prompt = f"""You are an expert evaluator for an AI assistant.

User input: {user_input[:400]}

Expected behavior (what the response must do):
{expected_behavior}
{ref_section}

Actual AI response:
{output[:800]}

Evaluate on a 0.0-1.0 scale:
1.0 - Fully satisfies expected behavior, accurate, complete
0.75 - Mostly satisfies, minor gaps
0.5 - Partially satisfies, missing key elements
0.25 - Significant problems or missing critical content
0.0 - Fails entirely, is harmful, or produces wrong output

Write exactly one sentence of reasoning.
Then on a new line write: SCORE: [decimal]

Example:
The response correctly identifies all required fields but omits the date range.
SCORE: 0.75"""

response = self.client.messages.create(
model=self.model,
max_tokens=150,
messages=[{"role": "user", "content": prompt}],
)
text = response.content[0].text.strip()

score = 0.5
reasoning = text
for line in text.split("\n"):
if line.startswith("SCORE:"):
try:
score = float(line.replace("SCORE:", "").strip())
reasoning = text.replace(line, "").strip()
break
except ValueError:
pass

return max(0.0, min(1.0, score)), reasoning


# ── Eval Harness ──────────────────────────────────────────────────────────

class EvalHarness:
"""
Runs a prompt version against a golden dataset and produces a scored EvalReport.
This is the core of the CI gate - every prompt PR runs through this.

Design goals:
- Fast enough for CI (target: 100 cases in under 10 minutes)
- Comprehensive enough to catch real regressions
- Cheap enough to run on every PR (use Haiku as judge)
"""

def __init__(
self,
pass_threshold: float = 0.85,
failure_threshold: float = 0.60, # Individual case below this = flagged failure
judge: Optional[LLMJudge] = None,
):
self.client = anthropic.Anthropic()
self.pass_threshold = pass_threshold
self.failure_threshold = failure_threshold
self.judge = judge or LLMJudge()

def run(
self,
prompt_name: str,
prompt_version: str,
system_prompt: str,
model: str,
max_tokens: int,
eval_cases: list[EvalCase],
temperature: float = 0.0, # Use 0 for deterministic CI runs
baseline_score: Optional[float] = None,
) -> EvalReport:
"""
Run the full eval harness.

Note: temperature=0.0 is recommended for CI runs. It does not eliminate
stochasticity at the model level but reduces variance enough to make
the threshold comparison meaningful.
"""
if not eval_cases:
raise ValueError("No eval cases provided.")

print(f"Eval: {prompt_name}@{prompt_version} | {model} | {len(eval_cases)} cases")
results = []

for i, case in enumerate(eval_cases):
result = self._run_case(case, system_prompt, model, max_tokens, temperature)
results.append(result)
status = "PASS" if result.score >= self.failure_threshold else "FAIL"
print(
f" [{i+1:3d}/{len(eval_cases)}] {case.id:45s} "
f"score={result.score:.2f} [{status}] ({result.strategy})"
)

return self._build_report(
prompt_name, prompt_version, model, results, baseline_score
)

def _run_case(
self,
case: EvalCase,
system: str,
model: str,
max_tokens: int,
temperature: float,
) -> CaseResult:
"""Run one eval case: generate output, then score it."""
user_content = case.input
if case.context:
user_content = f"{case.input}\n\nContext:\n{case.context}"

# Generate model output
start = time.monotonic()
try:
response = self.client.messages.create(
model=model,
max_tokens=max_tokens,
temperature=temperature,
system=system,
messages=[{"role": "user", "content": user_content}],
)
latency_ms = (time.monotonic() - start) * 1000
output = response.content[0].text
in_tok = response.usage.input_tokens
out_tok = response.usage.output_tokens
cost = compute_cost(model, in_tok, out_tok)
except Exception as e:
return CaseResult(
case_id=case.id, input=case.input, output="",
score=0.0, strategy="error", latency_ms=0,
input_tokens=0, output_tokens=0, cost_usd=0,
tags=case.tags, error=str(e),
)

# Choose and run eval strategy
score, strategy, reasoning = self._score(case, output)

return CaseResult(
case_id=case.id, input=case.input, output=output,
score=score, strategy=strategy,
latency_ms=latency_ms, input_tokens=in_tok,
output_tokens=out_tok, cost_usd=cost,
tags=case.tags, judge_reasoning=reasoning,
)

def _score(
self, case: EvalCase, output: str
) -> tuple[float, str, str]:
"""Select and run the best eval strategy for this case."""
import re

# 1. Exact match (highest precision, use when output is deterministic)
if case.required_elements == ["__EXACT__"]:
# Special sentinel for exact match mode
score = 1.0 if output.strip() == case.expected_behavior.strip() else 0.0
return score, "exact_match", ""

# 2. Contains check (all required elements must appear)
if case.required_elements:
out_lower = output.lower()
found = sum(1 for el in case.required_elements if el.lower() in out_lower)
return found / len(case.required_elements), "contains", ""

# 3. Regex pattern check
if case.output_pattern:
score = 1.0 if re.search(case.output_pattern, output, re.DOTALL) else 0.0
return score, "regex", ""

# 4. LLM judge (most flexible, highest quality signal)
if case.expected_behavior:
score, reasoning = self.judge.score(
case.input, output, case.expected_behavior
)
return score, "llm_judge", reasoning

# 5. No eval strategy - always pass (format-only tests)
return 1.0, "none", ""

def _build_report(
self,
prompt_name: str,
prompt_version: str,
model: str,
results: list[CaseResult],
baseline_score: Optional[float],
) -> EvalReport:
valid = [r for r in results if r.error is None]
scores = [r.score for r in valid]

if not scores:
raise RuntimeError("All eval cases returned errors. Check API key and model name.")

sorted_scores = sorted(scores)
n = len(sorted_scores)
pct = lambda p: sorted_scores[min(int(n * p), n - 1)]

# Subset scores by tag
tag_scores: dict[str, list[float]] = {}
for r in valid:
for tag in r.tags:
tag_scores.setdefault(tag, []).append(r.score)

mean_score = sum(scores) / len(scores)

return EvalReport(
prompt_name=prompt_name,
prompt_version=prompt_version,
model=model,
n_cases=len(results),
mean_score=mean_score,
std_score=statistics.stdev(scores) if len(scores) > 1 else 0.0,
p10_score=pct(0.10),
p50_score=pct(0.50),
p90_score=pct(0.90),
pass_rate=sum(1 for s in scores if s >= self.failure_threshold) / len(scores),
mean_latency_ms=sum(r.latency_ms for r in valid) / len(valid),
total_cost_usd=sum(r.cost_usd for r in results),
subset_scores={tag: sum(s)/len(s) for tag, s in tag_scores.items()},
failures=[r for r in valid if r.score < self.failure_threshold],
passed=mean_score >= self.pass_threshold,
threshold=self.pass_threshold,
delta_from_baseline=(
mean_score - baseline_score if baseline_score is not None else None
),
)

def print_report(self, report: EvalReport) -> None:
"""Print a formatted report to stdout."""
status = "PASSED" if report.passed else "FAILED"
delta_str = ""
if report.delta_from_baseline is not None:
sign = "+" if report.delta_from_baseline >= 0 else ""
delta_str = f" ({sign}{report.delta_from_baseline:.3f} vs baseline)"

print(f"\n{'='*65}")
print(f"EVAL: {report.prompt_name}@{report.prompt_version} [{status}]{delta_str}")
print(f"Model: {report.model}")
print(f"{'='*65}")
print(f"Score: mean={report.mean_score:.3f} std={report.std_score:.3f}")
print(f" p10={report.p10_score:.3f} p50={report.p50_score:.3f} "
f"p90={report.p90_score:.3f}")
print(f"Pass rate: {report.pass_rate:.1%} of cases above {self.failure_threshold}")
print(f"Latency: {report.mean_latency_ms:.0f}ms avg")
print(f"Cost: ${report.total_cost_usd:.4f} total")

if report.subset_scores:
print(f"\nSubset scores:")
for tag, score in sorted(report.subset_scores.items()):
ok = "OK " if score >= report.threshold else "FAIL"
print(f" [{ok}] {tag:35s} {score:.3f}")

if report.failures:
print(f"\nFailed cases ({len(report.failures)}):")
for r in report.failures[:5]:
print(f" [{r.case_id}] score={r.score:.2f} ({r.strategy})")
print(f" Input: {r.input[:80]}...")
if r.judge_reasoning:
print(f" Judge: {r.judge_reasoning[:100]}...")
print(f"{'='*65}\n")

def save_report(self, report: EvalReport, output_path: str) -> None:
"""Save report as JSON for CI artifact upload and historical tracking."""
data = {
**report.__dict__,
"failures": [r.__dict__ for r in report.failures],
}
with open(output_path, "w") as f:
json.dump(data, f, indent=2)

GitHub Actions CI Pipeline

The CI pipeline runs automatically on every pull request that touches prompt files or LLM-related application code. Speed matters: if the pipeline takes 30 minutes, engineers work around it.

# .github/workflows/llm-ci.yml
name: LLM CI Pipeline

on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- 'evals/**'

env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PASS_THRESHOLD: "0.85"

jobs:
# ── Stage 1: Schema and format validation (fast, free) ────────────────
schema-check:
name: Schema Validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install pyyaml jsonschema
- name: Validate prompt YAML schemas
run: python ci/validate_schemas.py
- name: Validate eval dataset JSON
run: python ci/validate_eval_datasets.py

# ── Stage 2: Eval gate (LLM judge, blocks merge on failure) ───────────
eval-gate:
name: Prompt Eval Gate
runs-on: ubuntu-latest
needs: schema-check
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Cache pip dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('requirements-ci.txt') }}

- run: pip install anthropic pyyaml

- name: Detect changed prompt files
id: detect
run: |
git fetch origin ${{ github.base_ref }}
git diff --name-only origin/${{ github.base_ref }}...HEAD \
| grep '^prompts/' > changed_prompts.txt || true
echo "count=$(wc -l < changed_prompts.txt | tr -d ' ')" >> $GITHUB_OUTPUT
echo "Changed prompts:"
cat changed_prompts.txt

- name: Run eval gate
if: steps.detect.outputs.count != '0'
id: eval
run: |
python ci/run_eval_gate.py \
--changed-files changed_prompts.txt \
--threshold ${{ env.PASS_THRESHOLD }} \
--output eval_report.json
continue-on-error: true # Don't fail here - fail in the next step after posting comment

- name: Post eval results as PR comment
if: steps.detect.outputs.count != '0'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
let body = '## LLM Eval Gate Results\n\n';
try {
const reports = JSON.parse(fs.readFileSync('eval_report.json'));
for (const r of reports) {
const icon = r.passed ? '✅' : '❌';
body += `### ${icon} ${r.prompt_name}@${r.prompt_version}\n\n`;
body += `| Metric | Value |\n|--------|-------|\n`;
body += `| Mean score | \`${r.mean_score.toFixed(3)}\` |\n`;
body += `| Threshold | \`${r.threshold}\` |\n`;
body += `| Pass rate | \`${(r.pass_rate * 100).toFixed(1)}%\` |\n`;
body += `| Cost | \`$${r.total_cost_usd.toFixed(4)}\` |\n`;
body += `| Cases | ${r.n_cases} |\n\n`;

if (r.subset_scores && Object.keys(r.subset_scores).length > 0) {
body += `**Subset scores:**\n`;
for (const [tag, score] of Object.entries(r.subset_scores)) {
const icon2 = score >= r.threshold ? '✅' : '⚠️';
body += `- ${icon2} \`${tag}\`: ${parseFloat(score).toFixed(3)}\n`;
}
body += '\n';
}

if (r.failures && r.failures.length > 0) {
body += `**Failed cases (${r.failures.length}):**\n`;
for (const f of r.failures.slice(0, 3)) {
body += `- score=\`${f.score.toFixed(2)}\`: ${f.input.slice(0, 80)}...\n`;
if (f.judge_reasoning) {
body += ` _${f.judge_reasoning.slice(0, 100)}_\n`;
}
}
}
}
} catch (e) {
body += '_Could not read eval results._\n';
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});

- name: Fail build if eval gate failed
if: steps.detect.outputs.count != '0' && steps.eval.outcome == 'failure'
run: |
echo "Eval gate FAILED. Review the PR comment for details."
exit 1

- name: Upload eval report artifact
if: steps.detect.outputs.count != '0'
uses: actions/upload-artifact@v3
with:
name: eval-report-${{ github.sha }}
path: eval_report.json

Canary Deployment with Quality Monitoring

After the CI gate passes, the canary deployment system monitors real production traffic for quality regression before full rollout.

# deployment/canary.py
"""
Canary deployment system for LLM features.
Routes a configurable percentage of traffic to the new version,
scores quality for both versions, and triggers rollback if signals degrade.
"""
import anthropic
import json
import hashlib
import time
import statistics
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional


@dataclass
class CanaryConfig:
feature_name: str
incumbent_version: str
canary_version: str
canary_pct: float = 0.05 # 5% to canary by default
monitor_hours: int = 48 # How long to monitor before decision
min_canary_samples: int = 100 # Min requests before deciding
quality_drop_threshold: float = 0.05 # Absolute score drop triggers rollback
error_rate_threshold: float = 0.02 # 2% error rate triggers rollback
cost_spike_threshold: float = 0.50 # 50% cost increase triggers alert


class CanaryMonitor:
"""
Records per-version quality, error, and cost metrics.
Evaluates canary health and recommends action.
"""

def __init__(self, config: CanaryConfig, store_dir: str = "canary_data"):
self.config = config
self.store = Path(store_dir)
self.store.mkdir(exist_ok=True)
self._metrics: dict[str, list[dict]] = {
"incumbent": [],
"canary": [],
}
self._load()

def route(self, user_id: str) -> str:
"""
Route a request to incumbent or canary.
Uses consistent hashing so the same user always gets the same version.
Avoids confusing users who would see different behavior on different requests.
"""
bucket = int(
hashlib.md5(f"{user_id}:{self.config.feature_name}".encode()).hexdigest(),
16
) % 100
if bucket < int(self.config.canary_pct * 100):
return self.config.canary_version
return self.config.incumbent_version

def record(
self,
version: str,
quality_score: float,
latency_ms: float,
cost_usd: float,
error: bool = False,
) -> None:
"""Record metrics for one request."""
group = "canary" if version == self.config.canary_version else "incumbent"
self._metrics[group].append({
"ts": datetime.utcnow().isoformat(),
"score": quality_score,
"latency_ms": latency_ms,
"cost_usd": cost_usd,
"error": error,
})
self._save()

def evaluate(self) -> dict:
"""
Evaluate canary health. Returns:
{
"action": "continue" | "promote" | "rollback" | "alert",
"reason": str,
"incumbent_score": float,
"canary_score": float,
...
}
"""
can = self._metrics["canary"]
inc = self._metrics["incumbent"]

if len(can) < self.config.min_canary_samples:
return {
"action": "continue",
"reason": f"Insufficient canary samples: {len(can)}/{self.config.min_canary_samples}",
"canary_samples": len(can),
}

# Core metrics
can_errors = [m for m in can if m["error"]]
can_ok = [m for m in can if not m["error"]]
inc_ok = [m for m in inc if not m["error"]]

can_error_rate = len(can_errors) / len(can) if can else 0
can_scores = [m["score"] for m in can_ok]
inc_scores = [m["score"] for m in inc_ok]

can_mean = statistics.mean(can_scores) if can_scores else 0
inc_mean = statistics.mean(inc_scores) if inc_scores else 0
score_delta = can_mean - inc_mean

can_cost = statistics.mean(m["cost_usd"] for m in can_ok) if can_ok else 0
inc_cost = statistics.mean(m["cost_usd"] for m in inc_ok) if inc_ok else 0
cost_delta_pct = (can_cost - inc_cost) / inc_cost if inc_cost > 0 else 0

base_result = {
"canary_samples": len(can),
"incumbent_score": inc_mean,
"canary_score": can_mean,
"score_delta": score_delta,
"canary_error_rate": can_error_rate,
"cost_delta_pct": cost_delta_pct,
}

# 1. Error rate spike → rollback immediately
if can_error_rate > self.config.error_rate_threshold:
return {
**base_result,
"action": "rollback",
"reason": (
f"Canary error rate {can_error_rate:.1%} exceeds "
f"threshold {self.config.error_rate_threshold:.1%}"
),
}

# 2. Quality drop → rollback
if score_delta < -self.config.quality_drop_threshold:
return {
**base_result,
"action": "rollback",
"reason": (
f"Canary quality drop {score_delta:.3f} exceeds "
f"threshold -{self.config.quality_drop_threshold}"
),
}

# 3. Cost spike → alert (not rollback - may be intentional)
if cost_delta_pct > self.config.cost_spike_threshold:
return {
**base_result,
"action": "alert",
"reason": (
f"Canary cost {cost_delta_pct:.1%} higher than incumbent. "
f"Investigate before promoting."
),
}

# 4. Monitoring window complete → promote
if can:
oldest = datetime.fromisoformat(can[0]["ts"])
elapsed = datetime.utcnow() - oldest
if elapsed >= timedelta(hours=self.config.monitor_hours):
if score_delta >= -0.01: # Canary is equivalent or better
return {
**base_result,
"action": "promote",
"reason": (
f"Monitoring window complete. "
f"Canary score {can_mean:.3f} vs incumbent {inc_mean:.3f} "
f"(delta {score_delta:+.3f})."
),
}

return {
**base_result,
"action": "continue",
"reason": "Canary within bounds, monitoring continues.",
}

def _save(self) -> None:
path = self.store / f"{self.config.feature_name}_metrics.json"
path.write_text(json.dumps(self._metrics, indent=2))

def _load(self) -> None:
path = self.store / f"{self.config.feature_name}_metrics.json"
if path.exists():
self._metrics = json.loads(path.read_text())


# ── Integration: Request with canary routing ──────────────────────────────

def handle_request_with_canary(
user_id: str,
user_message: str,
monitor: CanaryMonitor,
system_prompts: dict[str, str], # version → system prompt text
judge: "LLMJudge", # from eval_harness.py
) -> str:
"""
Handle a production request through the canary system.
Records quality metrics for monitoring and decision-making.
"""
client = anthropic.Anthropic()
version = monitor.route(user_id)
system = system_prompts[version]

start = time.monotonic()
error = False
output = ""
cost = 0.0

try:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=system,
messages=[{"role": "user", "content": user_message}],
)
output = response.content[0].text
cost = compute_cost(
"claude-3-5-sonnet-20241022",
response.usage.input_tokens,
response.usage.output_tokens,
)
except Exception:
error = True

latency_ms = (time.monotonic() - start) * 1000

# Score quality asynchronously in production; synchronously here for simplicity
quality_score = 0.5
if not error and output:
quality_score, _ = judge.score(
user_message, output,
"Provide an accurate, helpful, concise response"
)

monitor.record(version, quality_score, latency_ms, cost, error)

# Check canary health (in production, run this on a background scheduler)
decision = monitor.evaluate()
if decision["action"] == "rollback":
# In production: trigger PagerDuty, update feature flag config, alert team
print(f"CANARY ROLLBACK TRIGGERED: {decision['reason']}")
elif decision["action"] == "promote":
print(f"CANARY READY TO PROMOTE: {decision['reason']}")

return output

Rollback Architecture

The most important property of a rollback is that it must be fast and not require a redeploy. A redeploy takes 5–15 minutes. An incident that is visible to users for 15 minutes is measurably worse than one visible for 30 seconds. Build your rollback mechanism so it operates at the configuration level, not the deployment level.

# deployment/rollback.py
"""
Rollback manager - changes the active prompt/model version via a config file.
Application code reads this config on each request (or with a short-lived cache).
In production, replace with a feature flag service (LaunchDarkly, Unleash, Split).
"""
import json
import time
from pathlib import Path
from datetime import datetime


class RollbackManager:
"""
File-based feature configuration with rollback support.
The application polls this config (or receives a push update) and switches
behavior without requiring a code deploy.
"""

def __init__(self, config_path: str = "feature_config.json"):
self.path = Path(config_path)
self._config: dict = {}
self._load()

def _load(self) -> None:
if self.path.exists():
self._config = json.loads(self.path.read_text())

def _save(self) -> None:
self.path.write_text(json.dumps(self._config, indent=2))

def set_version(
self,
feature: str,
version: str,
reason: str = "",
operator: str = "",
) -> None:
"""Set the active version for a feature. Appends to change log."""
entry = {
"version": version,
"set_at": datetime.utcnow().isoformat(),
"reason": reason,
"operator": operator,
}
self._config[feature] = entry
self._save()

# Append to audit log
log_path = self.path.parent / f"{feature}_audit.jsonl"
with open(log_path, "a") as f:
f.write(json.dumps(entry) + "\n")

print(f"[{feature}] active version → {version}"
+ (f" ({reason})" if reason else ""))

def get_version(self, feature: str, default: str = "latest") -> str:
"""Get the currently active version for a feature."""
return self._config.get(feature, {}).get("version", default)

def rollback(
self,
feature: str,
to_version: str,
reason: str,
operator: str = "automated",
) -> None:
"""
Immediately roll back a feature to a known-good version.
This is the emergency brake - call it when canary signals trigger.
"""
self.set_version(
feature, to_version,
reason=f"ROLLBACK: {reason}",
operator=operator,
)
print(f"\nROLLBACK COMPLETE")
print(f" Feature: {feature}")
print(f" Version: {to_version}")
print(f" Reason: {reason}")
print(f" Operator: {operator}")
print(f" Time: {datetime.utcnow().isoformat()}")

def get_audit_log(self, feature: str, limit: int = 20) -> list[dict]:
"""Retrieve recent version changes for a feature."""
log_path = self.path.parent / f"{feature}_audit.jsonl"
if not log_path.exists():
return []
with open(log_path) as f:
lines = [json.loads(l) for l in f if l.strip()]
return list(reversed(lines))[-limit:]

Nightly Regression Scheduling

Beyond PR gates, run your eval harness on a nightly schedule to catch provider-side model updates that happen without notice:

# .github/workflows/nightly-regression.yml
name: Nightly Regression Check

on:
schedule:
- cron: '0 2 * * *' # 2 AM UTC daily
workflow_dispatch: # Allow manual trigger

jobs:
full-regression:
name: Full Regression Suite
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install anthropic pyyaml

- name: Run full regression suite (all production prompts)
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python ci/nightly_regression.py \
--all-production-prompts \
--threshold 0.82 \
--output nightly_report.json

- name: Upload report artifact
uses: actions/upload-artifact@v3
with:
name: nightly-regression-${{ github.run_id }}
path: nightly_report.json

- name: Alert on regression via Slack
if: failure()
uses: 8398a7/action-slack@v3
with:
status: failure
text: |
Nightly regression detected quality regression.
Review artifact: nightly-regression-${{ github.run_id }}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Production Engineering Notes

Eval Speed Is Not Optional

If your CI eval gate takes 30 minutes, engineers will find ways to work around it - merging prompt changes without waiting for CI, flagging tests as flaky, or simply not tagging prompt changes as prompt changes. Target under 10 minutes for your CI eval run. Strategies: keep the fast regression set small (30–50 cases is enough for meaningful signals), use claude-haiku-4-5-20251001 as the judge (not Opus), run eval cases concurrently where possible, and cache results for unchanged prompt versions using the prompt hash.

Separate Fast and Comprehensive Eval Sets

Maintain two eval datasets: a fast regression set (20–30 cases, runs in CI on every PR, 5–8 minutes) and a comprehensive eval set (200+ cases, runs nightly or before major releases). The fast set catches regressions quickly without blocking development. The comprehensive set gives you confident accuracy estimates. The two sets should have overlapping but not identical cases - overlap for consistency comparison, different cases for coverage breadth.

Track Eval Costs as Infrastructure Costs

Running an LLM judge eval on 50 cases twice a day costs real money - approximately 25perdaydependingonthejudgemodelandcasecomplexity.Overayear,thisis2–5 per day depending on the judge model and case complexity. Over a year, this is 700–1,800. Track this explicitly in your infrastructure cost dashboard. A reasonable ceiling: eval costs should not exceed 5% of production inference costs. If you are spending more than that on eval relative to production, you are over-evaluating; if you are spending less than 1%, you are under-evaluating and your quality monitoring is insufficient.

:::warning Non-Determinism Requires Conservative Thresholds Do not set your CI pass threshold at a value that changes with random seed variance. If your prompt consistently scores 0.87 mean and your threshold is 0.87, you will have intermittent CI failures from sampling noise. Set your threshold at least 3–5 points below your actual expected score. For a prompt you expect to score 0.88, set the threshold at 0.83. Use multiple eval runs and average the scores if variance is high. :::

:::danger Never Skip the Eval Gate Under Deadline Pressure The most common cause of production LLM incidents is "we had to ship fast and the eval takes too long." If the deadline is real, make the eval faster - smaller golden dataset, faster judge model, parallel execution. If the eval cannot be made faster, negotiate the deadline. A production LLM incident under deadline pressure costs more in remediation, customer trust repair, and engineering time than the deadline was worth. :::

:::tip Use temperature=0.0 for CI Runs Setting temperature to 0.0 for eval runs reduces output variance significantly, making CI pass/fail decisions more stable. It does not eliminate all non-determinism at the model level (hosted models have additional sources of variance), but it reduces the LLM-side variance enough that threshold comparisons become reliable. Run production at your actual temperature; run CI evals at 0.0. :::

Interview Q&A

Q1: Why does traditional CI fail for LLM applications, and what replaces it?

Traditional CI relies on deterministic assertions: assert output == expected. This fails for LLMs because outputs are stochastic - the same prompt returns different text on every call - and because quality is semantic, not structural. You cannot write a regex that detects "the model is now omitting action items from summaries." The replacement is a three-part system: schema validation for structural requirements (output is valid JSON, required fields present, response is non-empty), LLM-as-judge scoring for semantic quality (uses a capable model to evaluate output against a rubric, returning a float score), and statistical thresholding (mean score over a golden dataset must exceed a threshold, with subset analysis to catch partial regressions). None of these are perfect individually, but together they catch the failure modes that traditional CI cannot see.

Q2: How do you design a golden eval dataset for CI? What makes a good eval case?

A good golden eval dataset has four properties. Representative: covers the actual distribution of production queries, including common cases, edge cases, and failure-prone inputs - not just the easy happy path. Diverse: cases are labeled with tags (core, edge-case, adversarial, format) so subset scores can be computed - a regression that only affects edge cases is invisible without tag-level reporting. Annotated with expected behavior: each case has a description of what a correct response must do, written precisely enough for an LLM judge to evaluate against. Weighted: edge cases and adversarial cases should have higher weight than routine core cases. 50–100 well-designed cases with these properties provides more CI value than 500 poorly designed cases. Add cases aggressively after every production incident - each incident is a gap in your eval coverage.

Q3: Walk me through a canary deployment for an LLM application end-to-end.

After the CI eval gate passes and the PR merges, the canary begins. Deploy the new prompt/model version to your serving infrastructure but configure routing to send only 5% of production traffic to it. Use consistent hashing on user ID so the same user always receives the same version within their session - inconsistency within a session confuses users and muddies the quality signal. For every request in both the incumbent and canary cohorts, score quality using an LLM judge (asynchronously or on a sampled basis) and record score, latency, error flag, and cost to a metrics store. Monitor two primary signals: quality score differential (is canary's rolling mean within 0.05 of incumbent's?) and error rate (is canary producing more API errors or format failures?). If either signal degrades beyond threshold, trigger automatic rollback - update the routing config, no redeploy required, sub-30-second recovery. If both signals hold for 48 hours with at least 100 canary samples, promote the canary to 100% and deprecate the incumbent.

Q4: How do you handle LLM provider model updates that happen without notice?

Three defenses in depth. First, pin model versions explicitly in your code and configuration - use claude-3-5-sonnet-20241022 with the full date suffix, not just claude-3-5-sonnet. Most providers support explicit version IDs. Second, run your full regression eval suite on a nightly schedule, not just on code changes. If Anthropic updates the underlying model on a Monday, your nightly eval on Tuesday catches it. Third, maintain production quality monitoring: if you score 5–10% of production requests and maintain a rolling quality score alert, an unexpected model provider change shows up as an anomaly in your production metrics within hours. When you want to intentionally upgrade model versions, treat it like a prompt change: run a head-to-head eval on your golden dataset before routing any production traffic to the new version.

Q5: What is the minimum viable CI/CD setup for a two-person team shipping an LLM feature?

Three things, achievable in one day. First, a 20-example golden dataset in a JSON file - cover the core task, two edge cases, and one adversarial case. Second, a Python script that runs each example through the current prompt and scores the output with claude-haiku-4-5-20251001 as judge, printing a mean score and failing with exit code 1 if the score drops below 0.80. Third, a GitHub Actions workflow that runs this script on every PR that touches files under prompts/. That is the complete minimum. It will catch the failure mode that caused the example at the start of this lesson - the model upgrade that silently dropped action items - because the action items are required elements in the eval cases and the judge would have flagged their absence. Total setup time: one engineering day. Total cost per CI run: approximately $0.05. The ROI is immediate and measurable.

© 2026 EngineersOfAI. All rights reserved.