Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Drift Detection demo on the EngineersOfAI Playground - no code required. :::

Offline vs. Online Evaluation

The 90% Pass Rate That Meant Nothing

They had been meticulous. Before every release, the team ran 500 carefully curated test cases through their AI assistant. The threshold was 85% - any release scoring below that got blocked and sent back to the prompt engineering team. Over six months, they had raised the pass rate from 78% to 91%. The trend was unambiguously upward. Every week, the dashboard showed green.

But something was wrong. User retention was flat. The thumbs-up rate in the product hadn't moved in five months. The customer success team reported the same complaints in roughly the same proportions: answers that were technically correct but weirdly confident about uncertain topics, responses that ignored the follow-up context that users had clearly provided, occasional bizarre tangents that no one in QA had ever triggered. The system was getting better on the test set, but users couldn't tell.

The post-mortem took three days. They audited 200 recent support tickets against the test set and found the problem immediately: the test set had been built six months earlier, at launch, from a carefully curated library of expected questions. It was a beautiful dataset - diverse topics, edge cases, tricky phrasings, adversarial inputs. But it was a snapshot of what the team had imagined users would ask.

Real users asked different things. They asked in multi-turn conversations that evolved in directions the team hadn't anticipated. They switched topics mid-conversation. They asked highly domain-specific questions that required deep knowledge of their industry. They made typos and used company-internal jargon. The tail of the production distribution - the long tail that no curated dataset fully covers - was where all the failures lived.

The offline test set was measuring the team's ability to anticipate user needs. It was not measuring how well the system actually served users. The gap between offline and online evaluation was the gap between imagination and reality.

Why Both Are Necessary

Neither offline nor online evaluation alone is sufficient. Understanding why requires understanding what each can and cannot do.

Offline evaluation runs on a static, curated dataset before deployment. It is fast (you can run it in minutes), cheap, reproducible, and fully controlled. You can compare two model versions against identical inputs and get a clean signal. You can run it in CI as a deployment gate. You can share the dataset with other teams and external researchers.

What it cannot do: represent the full distribution of production inputs. No dataset, however carefully curated, captures the true diversity of what real users will do. The tail of the distribution is always longer than you think.

Online evaluation runs on real production traffic. It represents the actual distribution of inputs your system receives, including the tail. User behavior signals - whether they continue the conversation, whether they copy the response, whether they immediately rephrase and retry - are the closest thing to ground truth you can get. They measure actual utility, not proxy quality.

What it cannot do: run before deployment. You cannot A/B test a new model version before it exists in production. You cannot get user feedback on a system that isn't running. Online evaluation is inherently reactive - you find out about failures after they've happened to real users.

The evaluation flywheel is what connects the two: offline catches regressions before deployment, online catches failures that slip through, production failures get added to the offline dataset, which improves future offline coverage.

The Offline Evaluation Lifecycle

What Makes a Good Offline Dataset

The quality of your offline evaluation is bounded by the quality of your dataset. Most teams build mediocre datasets because they optimize for what's easy to curate (expected questions, typical scenarios) rather than what's most likely to reveal failures.

A high-quality offline dataset has five properties:

Diversity. It covers the full range of topics, intents, and phrasings that production users exhibit. If 20% of your production traffic is troubleshooting questions, 20% of your test cases should be troubleshooting questions.

Edge coverage. It deliberately includes inputs that are at the boundary of what the system should handle: questions that are ambiguous, questions that are almost but not quite in scope, questions that combine multiple topics, questions with typos or non-standard phrasing.

Adversarial cases. It includes inputs designed to trigger failures: jailbreak attempts, prompt injection attempts, inputs that closely resemble training data but with key facts changed, questions that sound straightforward but have counterintuitive correct answers.

Temporal coverage. It includes examples from different time periods. A dataset built entirely from month 1 of production will underrepresent the topics and phrasings that became common in month 6.

Diverse difficulty. It includes easy cases (where any reasonable system should pass), hard cases (where even expert humans disagree), and everything in between. A dataset of all easy cases produces artificially high pass rates that don't predict production performance.

Regression Testing vs. Model Comparison

Offline evaluation serves two distinct purposes that require different dataset design:

Regression testing: does the new version maintain quality on cases the current version handles well? This requires a dataset that covers known-good scenarios - the cases where you are confident the current system is correct. Regressions show up as failures on cases that previously passed.

Model comparison: is the new version better than the current version on the full distribution? This requires a dataset that is representative of production, not just current successes. A new model might fix regressions while introducing new failure modes - you only catch this with a diverse evaluation set.

In practice, teams maintain two datasets: a regression test set (fixed, never modified, tests for backward compatibility) and a quality benchmark set (periodically refreshed from production examples, tests for overall quality).

Online Evaluation Strategies

A/B Testing

A/B testing is the gold standard for measuring the impact of a change on real user behavior. You split your traffic: a control group sees the current system, a treatment group sees the new system. After sufficient traffic, you compare the metrics between groups.

The key design decisions:

Randomization unit. Randomize at the user level (same user always sees the same variant), not the request level (same user might see both variants). User-level randomization prevents contamination - users who see a better response in one conversation being influenced by it in their next.

Metric selection. Choose your primary metric before running the experiment, and commit to it. Post-hoc metric selection (picking the metric that shows the biggest effect after seeing the data) is p-hacking and produces false positives.

Statistical significance. Calculate the required sample size before starting. The required sample size depends on your baseline metric rate, the minimum effect size you care about detecting, your false positive rate (alpha, typically 0.05), and your statistical power (typically 0.80). Under-powered experiments produce inconclusive results.

Experiment duration. Run experiments for at least one full week to capture day-of-week effects. Users behave differently on weekdays vs. weekends. Running for only a few days can produce biased results.

Shadow Evaluation

Shadow evaluation runs a new system silently alongside the production system. Users see the production system's response; the new system's response is captured but never shown. The two responses are then compared by an LLM judge to determine which would have been preferred.

Shadow evaluation is valuable when:

  • You want to evaluate a new system without exposing users to potential quality regressions
  • You need to evaluate a system that isn't safe to deploy yet (e.g., a model change with unknown safety properties)
  • You want to build up preference data before committing to a rollout

The limitation: shadow evaluation measures what the judge prefers, not what users prefer. If your judge has biases (verbosity bias, position bias), those biases show up in shadow evaluation results. Shadow evaluation should be calibrated against actual user preference signals before being trusted.

Interleaving

Interleaving is an alternative to A/B testing for pairwise comparison. Instead of showing user A the control system and user B the treatment system, you interleave results from both systems for the same user in the same session.

A user searches for something and gets a mixed result list: some items from system A, some from system B. Which items the user clicks on reveals their preference, controlling for session-level confounders (time of day, user intent, query difficulty) that A/B testing cannot control for.

Interleaving is widely used in search and recommendation systems. It is more statistically efficient than A/B testing - it can detect smaller effects with less traffic - but it requires careful implementation to avoid showing users obviously inconsistent results.

Canary Deployment

Canary deployment exposes a small fraction of traffic (typically 1-10%) to the new system before full rollout. It is not an evaluation strategy in isolation, but it is a critical risk mitigation that should accompany every significant system change.

The canary period (typically 24-72 hours) allows you to monitor online metrics for the treatment group before committing to full rollout. If metrics degrade during the canary period, you roll back immediately.

Signal Hierarchy: From Implicit to Explicit

Implicit Signals in Depth

Copy rate. When a user copies text from an AI response, they are signaling that the response contained something worth keeping. Copy events are logged by the front-end and correlated with response IDs. Copy rate is one of the highest-value implicit signals because copying is effort - users only do it when they found something genuinely useful.

Session continuation rate. After an AI responds, does the user ask a follow-up question? Continuation signals engagement. A sudden drop in continuation rate after a model change often indicates that users are getting responses that feel like dead ends - responses that answer the literal question but don't invite the conversation to continue.

Rephrase-retry rate. If a user rephrases their question within 30 seconds of getting a response, the response likely didn't answer what they wanted. Rephrase-retry events are one of the clearest signals of failure - the user is telling you, through behavior rather than words, that the response missed.

Time-to-task-completion. For task-oriented systems (booking a flight, filing a support ticket, completing a code review), time-to-completion is a direct measure of efficiency. A model change that makes responses longer but takes three more conversational turns to complete a task is probably worse, not better, even if the individual responses score higher on quality metrics.

Production Code: Offline and Online Evaluation Infrastructure

import anthropic
import asyncio
import hashlib
import json
import math
import random
import re
import statistics
import time
from dataclasses import dataclass, field
from typing import Optional, Callable

client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()


# ---------------------------------------------------------------------------
# Data Structures
# ---------------------------------------------------------------------------

@dataclass
class TestCase:
question: str
ground_truth: Optional[str] = None
tags: list[str] = field(default_factory=list)
difficulty: str = "medium" # easy | medium | hard | adversarial
case_id: Optional[str] = None

def __post_init__(self):
if self.case_id is None:
self.case_id = hashlib.md5(self.question.encode()).hexdigest()[:8]


@dataclass
class OfflineEvalResult:
case_id: str
question: str
response: str
score: float
passed: bool
failure_modes: list[str]
latency_ms: float
metadata: dict = field(default_factory=dict)


@dataclass
class OnlineEvent:
event_type: str # response_shown | copy | continue | rephrase_retry | thumbs_up | thumbs_down
session_id: str
response_id: str
timestamp: float
metadata: dict = field(default_factory=dict)


@dataclass
class ComparisonReport:
baseline_name: str
candidate_name: str
baseline_pass_rate: float
candidate_pass_rate: float
delta: float
regressions: list[str] # case IDs where baseline passed, candidate failed
improvements: list[str] # case IDs where baseline failed, candidate passed
recommended: str


# ---------------------------------------------------------------------------
# Simple LLM Judge for offline use
# ---------------------------------------------------------------------------

def _llm_judge_score(question: str, response: str, ground_truth: Optional[str] = None) -> tuple[float, str]:
"""Quick LLM-based quality score using claude-haiku-4-5-20251001."""
if ground_truth:
prompt = f"""Evaluate this response for correctness and helpfulness.
Question: {question}
Expected answer (reference): {ground_truth}
Actual response: {response}

Score 0.0-1.0 where:
1.0 = correct, complete, helpful
0.7 = mostly correct with minor issues
0.4 = partially correct or relevant
0.0 = wrong, harmful, or irrelevant

Reply with JSON: {{"score": <float>, "reasoning": "<one sentence>"}}"""
else:
prompt = f"""Evaluate this response for relevance and helpfulness.
Question: {question}
Response: {response}

Score 0.0-1.0 where 1.0 = highly relevant and helpful.
Reply with JSON: {{"score": <float>, "reasoning": "<one sentence>"}}"""

msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
content = msg.content[0].text
match = re.search(r'\{[^}]+\}', content, re.DOTALL)
if match:
try:
parsed = json.loads(match.group(0))
return float(parsed.get("score", 0.5)), parsed.get("reasoning", "")
except (json.JSONDecodeError, ValueError):
pass
return 0.5, "Could not parse judge response"


# ---------------------------------------------------------------------------
# Offline Evaluation Suite
# ---------------------------------------------------------------------------

class OfflineEvaluationSuite:
"""Run structured evaluation against a static dataset."""

def __init__(
self,
system_under_test: Callable[[str], str],
pass_threshold: float = 0.70,
):
self.sut = system_under_test
self.pass_threshold = pass_threshold

def _evaluate_single(self, test_case: TestCase) -> OfflineEvalResult:
start = time.time()
try:
response = self.sut(test_case.question)
except Exception as e:
return OfflineEvalResult(
case_id=test_case.case_id,
question=test_case.question,
response="",
score=0.0,
passed=False,
failure_modes=["system_error"],
latency_ms=(time.time() - start) * 1000,
metadata={"error": str(e)},
)
latency_ms = (time.time() - start) * 1000

score, reasoning = _llm_judge_score(
test_case.question, response, test_case.ground_truth
)

# Basic rule checks
failure_modes = []
if len(response.split()) < 5:
failure_modes.append("too_short")
score = min(score, 0.3)
if len(response.split()) > 600:
failure_modes.append("too_long")
if score < self.pass_threshold:
failure_modes.append("quality_below_threshold")

return OfflineEvalResult(
case_id=test_case.case_id,
question=test_case.question,
response=response,
score=score,
passed=score >= self.pass_threshold and "too_short" not in failure_modes,
failure_modes=failure_modes,
latency_ms=latency_ms,
metadata={"reasoning": reasoning, "tags": test_case.tags},
)

def run(self, test_cases: list[TestCase]) -> dict:
"""Run evaluation on a set of test cases and return aggregate report."""
results = [self._evaluate_single(tc) for tc in test_cases]
passed = [r for r in results if r.passed]
failed = [r for r in results if not r.passed]

# Score by difficulty
difficulty_scores = {}
for tc, result in zip(test_cases, results):
d = tc.difficulty
if d not in difficulty_scores:
difficulty_scores[d] = []
difficulty_scores[d].append(result.score)

return {
"total": len(results),
"passed": len(passed),
"failed": len(failed),
"pass_rate": len(passed) / len(results) if results else 0.0,
"mean_score": statistics.mean(r.score for r in results) if results else 0.0,
"mean_latency_ms": statistics.mean(r.latency_ms for r in results) if results else 0.0,
"difficulty_breakdown": {
d: {
"mean_score": statistics.mean(scores),
"n": len(scores),
}
for d, scores in difficulty_scores.items()
},
"failure_mode_counts": _count_failure_modes(results),
"results": results,
}

def compare_versions(
self,
baseline_sut: Callable[[str], str],
candidate_sut: Callable[[str], str],
test_cases: list[TestCase],
baseline_name: str = "baseline",
candidate_name: str = "candidate",
) -> ComparisonReport:
"""Compare two system versions on the same test set."""
baseline_suite = OfflineEvaluationSuite(baseline_sut, self.pass_threshold)
candidate_suite = OfflineEvaluationSuite(candidate_sut, self.pass_threshold)

print(f"Running baseline ({baseline_name})...")
baseline_report = baseline_suite.run(test_cases)
print(f"Running candidate ({candidate_name})...")
candidate_report = candidate_suite.run(test_cases)

baseline_results = {r.case_id: r for r in baseline_report["results"]}
candidate_results = {r.case_id: r for r in candidate_report["results"]}

regressions = []
improvements = []
for case_id in baseline_results:
b = baseline_results[case_id]
c = candidate_results.get(case_id)
if c is None:
continue
if b.passed and not c.passed:
regressions.append(case_id)
elif not b.passed and c.passed:
improvements.append(case_id)

baseline_pass_rate = baseline_report["pass_rate"]
candidate_pass_rate = candidate_report["pass_rate"]
delta = candidate_pass_rate - baseline_pass_rate

# Recommend candidate if: improves pass rate by >2% AND no major regressions
regression_rate = len(regressions) / len(test_cases) if test_cases else 0
recommended = candidate_name if (delta > 0.02 and regression_rate < 0.05) else baseline_name

return ComparisonReport(
baseline_name=baseline_name,
candidate_name=candidate_name,
baseline_pass_rate=baseline_pass_rate,
candidate_pass_rate=candidate_pass_rate,
delta=delta,
regressions=regressions,
improvements=improvements,
recommended=recommended,
)

def regression_check(
self,
current_results: dict,
historical_pass_rates: list[float],
threshold_std: float = 2.0,
) -> list[str]:
"""Flag a regression if current pass rate is more than N std devs below historical mean."""
if not historical_pass_rates:
return []
mean = statistics.mean(historical_pass_rates)
std = statistics.stdev(historical_pass_rates) if len(historical_pass_rates) > 1 else 0.05
current = current_results["pass_rate"]
regressions = []
if current < mean - threshold_std * std:
regressions.append(
f"Pass rate regression: {current:.2%} vs. historical mean {mean:.2%} "
f"(>{threshold_std:.1f} std devs below)"
)
return regressions


def _count_failure_modes(results: list[OfflineEvalResult]) -> dict[str, int]:
counts = {}
for r in results:
for mode in r.failure_modes:
counts[mode] = counts.get(mode, 0) + 1
return dict(sorted(counts.items(), key=lambda x: -x[1]))


# ---------------------------------------------------------------------------
# Online Metric Collector
# ---------------------------------------------------------------------------

class OnlineMetricCollector:
"""
Collect and compute online signals from user behavior.
In production this would write to/read from a database.
This implementation uses an in-memory store for illustration.
"""

def __init__(self):
self._events: list[OnlineEvent] = []

def record_event(self, event: OnlineEvent):
self._events.append(event)

def record_response_shown(self, session_id: str, response_id: str):
self.record_event(OnlineEvent("response_shown", session_id, response_id, time.time()))

def record_copy(self, session_id: str, response_id: str):
self.record_event(OnlineEvent("copy", session_id, response_id, time.time()))

def record_continuation(self, session_id: str, response_id: str):
self.record_event(OnlineEvent("continue", session_id, response_id, time.time()))

def record_rephrase_retry(self, session_id: str, response_id: str, delay_seconds: float):
self.record_event(OnlineEvent(
"rephrase_retry", session_id, response_id, time.time(),
metadata={"delay_seconds": delay_seconds},
))

def record_thumbs_up(self, session_id: str, response_id: str):
self.record_event(OnlineEvent("thumbs_up", session_id, response_id, time.time()))

def record_thumbs_down(self, session_id: str, response_id: str):
self.record_event(OnlineEvent("thumbs_down", session_id, response_id, time.time()))

def _events_in_window(self, window_hours: float) -> list[OnlineEvent]:
cutoff = time.time() - window_hours * 3600
return [e for e in self._events if e.timestamp >= cutoff]

def compute_copy_rate(self, window_hours: float = 24.0) -> float:
events = self._events_in_window(window_hours)
responses_shown = sum(1 for e in events if e.event_type == "response_shown")
copies = sum(1 for e in events if e.event_type == "copy")
return copies / responses_shown if responses_shown > 0 else 0.0

def compute_session_continuation_rate(self, window_hours: float = 24.0) -> float:
events = self._events_in_window(window_hours)
# Group by session
sessions: dict[str, set] = {}
for e in events:
if e.session_id not in sessions:
sessions[e.session_id] = set()
sessions[e.session_id].add(e.event_type)
if not sessions:
return 0.0
sessions_with_continuation = sum(
1 for event_types in sessions.values() if "continue" in event_types
)
return sessions_with_continuation / len(sessions)

def compute_rephrase_retry_rate(self, window_hours: float = 24.0, max_delay_s: float = 30.0) -> float:
events = self._events_in_window(window_hours)
responses_shown = sum(1 for e in events if e.event_type == "response_shown")
quick_retries = sum(
1 for e in events
if e.event_type == "rephrase_retry"
and e.metadata.get("delay_seconds", 999) <= max_delay_s
)
return quick_retries / responses_shown if responses_shown > 0 else 0.0

def compute_explicit_feedback_rate(self, window_hours: float = 24.0) -> dict:
events = self._events_in_window(window_hours)
responses_shown = sum(1 for e in events if e.event_type == "response_shown")
thumbs_up = sum(1 for e in events if e.event_type == "thumbs_up")
thumbs_down = sum(1 for e in events if e.event_type == "thumbs_down")
total_explicit = thumbs_up + thumbs_down
return {
"thumbs_up": thumbs_up,
"thumbs_down": thumbs_down,
"thumbs_up_rate": thumbs_up / responses_shown if responses_shown > 0 else 0.0,
"thumbs_down_rate": thumbs_down / responses_shown if responses_shown > 0 else 0.0,
"positive_rate": thumbs_up / total_explicit if total_explicit > 0 else 0.5,
"feedback_response_rate": total_explicit / responses_shown if responses_shown > 0 else 0.0,
}

def get_dashboard(self, window_hours: float = 24.0) -> dict:
return {
"window_hours": window_hours,
"copy_rate": self.compute_copy_rate(window_hours),
"session_continuation_rate": self.compute_session_continuation_rate(window_hours),
"rephrase_retry_rate": self.compute_rephrase_retry_rate(window_hours),
"explicit_feedback": self.compute_explicit_feedback_rate(window_hours),
}


# ---------------------------------------------------------------------------
# A/B Test Designer
# ---------------------------------------------------------------------------

class ABTestDesigner:
"""Design and analyze A/B tests for AI system changes."""

@staticmethod
def compute_required_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
alpha: float = 0.05,
power: float = 0.80,
) -> int:
"""
Compute the required number of samples per variant for a two-proportion z-test.

baseline_rate: Expected metric rate in control (e.g., 0.12 for 12% thumbs-up rate)
minimum_detectable_effect: Smallest relative change you want to detect (e.g., 0.10 for 10%)
alpha: False positive rate (typically 0.05)
power: Statistical power (typically 0.80)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)

# Z-scores for alpha and power (two-tailed)
z_alpha = 1.96 # 0.05 significance level
z_power = 0.842 # 0.80 power

p_bar = (p1 + p2) / 2
numerator = (z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
z_power * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)

@staticmethod
def assign_variant(user_id: str, experiment_id: str, traffic_fraction: float = 0.5) -> str:
"""
Deterministically assign a user to control or treatment.
Same user always gets the same variant for the same experiment.
"""
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 1000) / 1000.0 # 0.0 to 1.0
if bucket < traffic_fraction:
return "treatment"
return "control"

@staticmethod
def two_proportion_z_test(n1: int, p1: float, n2: int, p2: float) -> dict:
"""
Test whether two observed proportions are significantly different.
Returns z-score, p-value, and significance conclusion.
"""
if n1 == 0 or n2 == 0:
return {"z_score": 0, "p_value": 1.0, "significant": False}

p_pooled = (p1 * n1 + p2 * n2) / (n1 + n2)
se = math.sqrt(p_pooled * (1 - p_pooled) * (1 / n1 + 1 / n2))
if se == 0:
return {"z_score": 0, "p_value": 1.0, "significant": False}

z = (p1 - p2) / se
# Approximate p-value using normal CDF approximation
# For |z| > 1.96, p < 0.05
# For |z| > 2.576, p < 0.01
abs_z = abs(z)
if abs_z >= 2.576:
p_value = 0.01
elif abs_z >= 1.96:
p_value = 0.05
elif abs_z >= 1.645:
p_value = 0.10
else:
# Rough linear interpolation for smaller z
p_value = max(0.10, 1.0 - abs_z / 2.0)

return {
"z_score": z,
"p_value": p_value,
"significant": p_value <= 0.05,
"direction": "treatment_better" if p2 > p1 else "control_better",
"relative_lift": (p2 - p1) / p1 if p1 > 0 else 0,
}

def analyze_experiment(
self,
control_metrics: dict,
treatment_metrics: dict,
metric_name: str = "thumbs_up_rate",
) -> dict:
"""
Analyze the results of a completed A/B test.

control_metrics / treatment_metrics: dicts with keys 'n' (sample size)
and the metric value (e.g., 'thumbs_up_rate': 0.12)
"""
n_control = control_metrics["n"]
n_treatment = treatment_metrics["n"]
p_control = control_metrics.get(metric_name, 0)
p_treatment = treatment_metrics.get(metric_name, 0)

test_result = self.two_proportion_z_test(n_control, p_control, n_treatment, p_treatment)

return {
"metric": metric_name,
"control": {"n": n_control, "rate": p_control},
"treatment": {"n": n_treatment, "rate": p_treatment},
"statistical_test": test_result,
"recommendation": (
"ship_treatment" if (test_result["significant"] and p_treatment > p_control)
else "no_ship" if (test_result["significant"] and p_treatment < p_control)
else "inconclusive"
),
}


# ---------------------------------------------------------------------------
# Shadow Evaluator
# ---------------------------------------------------------------------------

class ShadowEvaluator:
"""
Run a shadow system alongside production and compare responses.
The shadow system's responses are never shown to users.
"""

def __init__(self, production_sut: Callable, shadow_sut: Callable):
self.production = production_sut
self.shadow = shadow_sut
self._comparisons: list[dict] = []

def run_shadow(self, question: str) -> dict:
"""Run both systems and compare responses without showing shadow to user."""
prod_response = self.production(question)
shadow_response = self.shadow(question)

preference = self._compare_responses(question, prod_response, shadow_response)
result = {
"question": question[:80],
"production_response": prod_response[:200],
"shadow_response": shadow_response[:200],
"preference": preference,
}
self._comparisons.append(result)
return result

def _compare_responses(self, question: str, prod: str, shadow: str) -> str:
"""Use LLM judge to compare responses. Returns 'production', 'shadow', or 'tie'."""
prompt = f"""You are evaluating two AI responses. Which is better?

Question: {question}

Response A: {prod[:400]}

Response B: {shadow[:400]}

Which response is better overall? Consider accuracy, relevance, helpfulness, and conciseness.
Reply with exactly one word: A, B, or tie."""
msg = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=5,
messages=[{"role": "user", "content": prompt}],
)
verdict = msg.content[0].text.strip().upper()
if verdict == "A":
return "production"
elif verdict == "B":
return "shadow"
return "tie"

def aggregate_results(self) -> dict:
if not self._comparisons:
return {"n": 0, "shadow_preference_rate": 0.0}
total = len(self._comparisons)
prod_wins = sum(1 for c in self._comparisons if c["preference"] == "production")
shadow_wins = sum(1 for c in self._comparisons if c["preference"] == "shadow")
ties = total - prod_wins - shadow_wins
return {
"n": total,
"production_wins": prod_wins,
"shadow_wins": shadow_wins,
"ties": ties,
"shadow_preference_rate": shadow_wins / total,
"production_preference_rate": prod_wins / total,
"recommendation": "promote_shadow" if shadow_wins / total > 0.55 else "keep_production",
}


# ---------------------------------------------------------------------------
# The Evaluation Flywheel
# ---------------------------------------------------------------------------

class EvaluationFlywheel:
"""
Closes the loop between online failures and offline test coverage.
Samples failures from production, adds them to the offline dataset,
and triggers re-evaluation.
"""

def __init__(self, offline_suite: OfflineEvaluationSuite, online_collector: OnlineMetricCollector):
self.offline_suite = offline_suite
self.online = online_collector
self.offline_dataset: list[TestCase] = []
self.failure_log: list[dict] = []

def log_production_failure(
self,
question: str,
response: str,
failure_reason: str,
source: str = "user_report",
):
"""Log a production failure for potential addition to the offline dataset."""
self.failure_log.append({
"question": question,
"response": response,
"failure_reason": failure_reason,
"source": source,
"timestamp": time.time(),
})

def sample_and_add_failures(self, n: int = 50) -> int:
"""
Sample n failures from the production failure log and add to offline dataset.
Deduplicates by question similarity before adding.
Returns the number of new cases added.
"""
if not self.failure_log:
return 0

sample_size = min(n, len(self.failure_log))
sampled = random.sample(self.failure_log, sample_size)

existing_questions = {tc.question for tc in self.offline_dataset}
added = 0

for failure in sampled:
question = failure["question"]
if question not in existing_questions:
self.offline_dataset.append(TestCase(
question=question,
tags=["production_failure", failure["source"]],
difficulty="hard",
))
existing_questions.add(question)
added += 1

return added

def run_flywheel_cycle(self, n_failures_to_add: int = 50) -> dict:
"""
Complete one flywheel cycle:
1. Sample failures from production
2. Add to offline dataset
3. Run offline evaluation on new cases
4. Return combined report
"""
print("Step 1: Sampling production failures...")
added = self.sample_and_add_failures(n_failures_to_add)
print(f" Added {added} new test cases from production failures")

if not self.offline_dataset:
return {"error": "No offline test cases available"}

print("Step 2: Running offline evaluation on full dataset...")
report = self.offline_suite.run(self.offline_dataset)

print("Step 3: Fetching online metrics...")
online_dashboard = self.online.get_dashboard(window_hours=24)

return {
"offline_report": {
"total_cases": report["total"],
"pass_rate": report["pass_rate"],
"mean_score": report["mean_score"],
"failure_modes": report["failure_mode_counts"],
},
"online_metrics": online_dashboard,
"dataset_growth": {
"total_cases": len(self.offline_dataset),
"newly_added": added,
},
}


# ---------------------------------------------------------------------------
# Demo Usage
# ---------------------------------------------------------------------------

def make_demo_sut(model: str = "claude-haiku-4-5-20251001"):
"""Create a simple system-under-test that calls Claude."""
def sut(question: str) -> str:
msg = client.messages.create(
model=model,
max_tokens=512,
messages=[{"role": "user", "content": question}],
)
return msg.content[0].text
return sut


if __name__ == "__main__":
designer = ABTestDesigner()

# How large does the test need to be?
sample_size = designer.compute_required_sample_size(
baseline_rate=0.12, # 12% thumbs-up rate
minimum_detectable_effect=0.15, # want to detect 15% relative improvement
)
print(f"Required sample size per variant: {sample_size:,}")

# Simulate experiment results
result = designer.analyze_experiment(
control_metrics={"n": sample_size, "thumbs_up_rate": 0.120},
treatment_metrics={"n": sample_size, "thumbs_up_rate": 0.135},
metric_name="thumbs_up_rate",
)
print(f"Experiment recommendation: {result['recommendation']}")
print(f"Relative lift: {result['statistical_test']['relative_lift']:.1%}")
print(f"Statistically significant: {result['statistical_test']['significant']}")

# Demonstrate variant assignment
for uid in ["user_001", "user_002", "user_003"]:
variant = designer.assign_variant(uid, "experiment_v2_prompts")
print(f" {uid}{variant}")

The Evaluation Flywheel in Full

Production Notes

:::tip Refresh Your Dataset Quarterly Production input distributions shift over time. A dataset built 12 months ago may significantly underrepresent the topics and phrasings that are common today. Schedule a quarterly dataset refresh: sample 100-200 recent production examples, evaluate them manually, and add the interesting ones to your test set. :::

:::warning Statistical Significance Is Not Business Significance A 2% lift in thumbs-up rate might be statistically significant with 50,000 samples, but it might not be worth the engineering cost, prompt complexity, or latency regression it introduced. Always pair statistical significance with business significance: is this improvement large enough to matter? Set minimum detectable effect thresholds based on what would actually change user behavior, not what your experiment can detect. :::

:::danger Do Not Use Your Offline Dataset as a Training Signal If your offline evaluation dataset gets used as fine-tuning data or RLHF examples, it is no longer a valid evaluation set. The model will overfit to the test cases. Keep evaluation datasets strictly separate from training data. When in doubt, create a new held-out evaluation set. :::

:::tip Instrument Copy Events From Day One Copy-to-clipboard events are one of the highest-value implicit signals and are trivially easy to instrument in the front end. If you haven't instrumented them yet, do it today. The signal is immediately useful even before you have enough data for A/B tests. :::

Interview Q&A

Q1: What is the fundamental difference between offline and online evaluation, and why do you need both?

Offline evaluation runs on a curated static dataset before deployment. It is fast, reproducible, and comparable across versions. It lets you catch regressions before users see them. But it is bounded by the quality of your dataset - it measures performance on inputs you thought to test, not inputs users actually send. Online evaluation runs on real production traffic and captures the true input distribution, including the long tail. But it is inherently reactive - you find out about failures after they've happened. You need both because offline evaluation is your safety net before deployment and online evaluation is your ground truth about what actually matters in production. The evaluation flywheel connects them: online failures become offline test cases, improving future coverage.

Q2: Walk me through the design of an A/B test for an LLM system change. What are the key decisions?

Five key decisions. First, randomization unit: randomize at the user level, not the request level, so the same user always sees the same variant and you avoid cross-contamination. Second, primary metric: choose one primary metric before running the experiment and commit to it. Post-hoc metric selection is p-hacking. Third, sample size: compute the required sample before starting using a two-proportion z-test formula with your baseline rate, minimum detectable effect, alpha (0.05), and power (0.80). Under-powered experiments produce inconclusive results. Fourth, duration: run for at least seven days to capture day-of-week effects. Fifth, guardrail metrics: define metrics that, if they degrade, trigger an automatic halt regardless of the primary metric - typically safety-related metrics or hard error rates.

Q3: What are the most valuable implicit online signals for an AI assistant, and why?

Copy rate and rephrase-retry rate are the two most informative. Copy rate measures whether users found something worth keeping - it requires deliberate action and is a strong signal of utility. Rephrase-retry rate (user rephrases their question within 30 seconds of a response) is a strong failure signal - the user is telling you through behavior that the response didn't answer what they wanted. Session continuation rate (does the user ask a follow-up?) is useful as a secondary signal. Thumbs-up/down are explicit but suffer from selection bias: users who feel strongly about a response (in either direction) are more likely to leave feedback than users who had a neutral experience. Aggregate thumbs-up rates are noisy at low volumes and require large samples to be statistically useful.

Q4: What is shadow evaluation and when should you use it?

Shadow evaluation runs a new system silently alongside the production system. The new system processes every request but its responses are never shown to users - only the production system's responses are shown. The shadow responses are compared to production responses by an LLM judge to build up a preference dataset before committing to a rollout. Use shadow evaluation when: (1) you want to evaluate a new system without exposing users to potential regressions, (2) the new system has unknown safety properties and you need to screen its outputs before showing them, or (3) you want a large preference dataset to guide the rollout decision. The key limitation: shadow evaluation measures what an LLM judge prefers, not what users prefer. Calibrate your shadow judge against real user preference signals before trusting it.

Q5: How do you close the offline-online evaluation gap?

The gap exists because your offline dataset was built from inputs you anticipated, not inputs users actually send. Three practices close it. First, systematic sampling of production failures: when users report problems or metrics degrade, capture those inputs and add them to your offline dataset after manual review. Second, diversity sampling: periodically sample random production requests (not just failure cases) and add them to the test set to improve distributional coverage. Third, temporal refresh: rebuild or augment your dataset every quarter to capture shifts in input distribution. The key discipline is that every production failure should eventually become a test case - if a failure happened once, it can happen again, and you want your offline evaluation to catch it before it reaches users.

Q6: Explain the evaluation flywheel concept. How does it connect offline and online evaluation over time?

The evaluation flywheel is a continuous improvement loop that connects the two halves of evaluation. It works like this: your offline test set gates each release; you deploy to production; your online monitoring catches failures that slip through; you sample those failures and add them to your offline dataset; the next release is evaluated against a richer dataset that covers the failures the previous release caused; better offline coverage catches more failures; fewer failures reach production; repeat. The flywheel has a compounding effect: each release cycle, your offline coverage improves, which means fewer and fewer failures reach production. The teams that implement the flywheel well find that the quality ceiling of their offline evaluation rises continuously, tracking the actual complexity of their production traffic distribution.

The next lesson addresses the biggest scalability challenge in evaluation: replacing expensive human judgment with calibrated, bias-corrected LLM judges that can evaluate thousands of responses a day at a fraction of the cost.

© 2026 EngineersOfAI. All rights reserved.