:::tip 🎮 Interactive Playground Visualize this concept: Try the Benchmark Explorer demo on the EngineersOfAI Playground - no code required. :::
Measuring AI Product Quality
The Dashboard Illusion
The product manager opens the monitoring dashboard at 9am on a Tuesday and feels good. Uptime: 99.94%. Average response time: 1.8 seconds. Messages processed today: 14,342. Error rate: 0.2%. Every metric is green. She posts a screenshot to Slack with the caption "AI assistant is running smoothly."
That same morning, a senior engineer on her team receives an email from a customer. The customer had asked the AI assistant to help draft a compliance clause for a vendor contract. The assistant confidently produced a clause with a specific regulatory citation. The customer's legal team almost signed the contract before noticing the cited regulation had been amended eighteen months earlier - the AI had hallucinated a plausible but incorrect version. The customer is threatening to cancel. The dashboard never captured this. The response arrived in 1.4 seconds with a 200 status code. From an infrastructure perspective, it was a perfect interaction.
This gap - between operational health and actual product quality - is the defining challenge of AI product measurement. A traditional web server returns either the right data or an error. An LLM returns text that may be fluent, confident, and completely wrong. Measuring whether that text genuinely helped users requires a fundamentally different approach than measuring whether the bytes arrived. You need three separate signal streams working in concert: explicit user feedback, implicit behavioral signals, and automated quality assessment. Building and maintaining this measurement system is not optional engineering. It is the foundation on which every other AI product decision rests.
Why Measurement Is Different for AI Products
Traditional software quality is mostly binary: the feature works or it does not. An API returns the correct data or throws an exception. A form validates correctly or lets invalid data through. You write tests, they pass, quality is confirmed. Deployment is safe.
AI product quality exists on a spectrum, and that spectrum shifts continuously. The same prompt can produce a helpful response on Monday and a subtly wrong one on Wednesday after a model update. A prompt that works perfectly for 90% of queries may systematically fail for a specific user segment you have not yet tested. Quality can degrade slowly over weeks as user query distributions drift without any code change at all. There are no unit tests that can catch these regressions before they reach users.
This is why measurement must be continuous, multi-dimensional, and embedded into the product itself - not bolted on as a post-launch concern.
The Four Signal Types
| Signal Type | Coverage | Latency | Signal Strength | Cost |
|---|---|---|---|---|
| Explicit feedback | 2–5% | Minutes | Very high | Free |
| Implicit behavioral | 100% | Real-time | Medium | Free |
| LLM-as-judge | 5–20% sampled | Minutes | High | Medium |
| Human eval | 0.1–1% | Days | Very high | High |
| Business metrics | Cohort | Days–weeks | Very high | Free |
No single signal type is sufficient. Explicit feedback is sparse but high-signal. Implicit signals are dense but noisy. LLM-as-judge scales but has systematic biases. Business metrics are ground truth but lag by days. A production quality system combines all four.
Explicit Feedback Collection
# quality/feedback.py
from __future__ import annotations
import asyncio
import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import asyncpg
class FeedbackType(str, Enum):
THUMBS_UP = "thumbs_up"
THUMBS_DOWN = "thumbs_down"
COPY = "copy"
REGENERATE = "regenerate"
EDIT = "edit"
SHARE = "share"
REPORT = "report"
class DownvoteCategory(str, Enum):
INCORRECT = "incorrect"
UNHELPFUL = "unhelpful"
TOO_LONG = "too_long"
TOO_SHORT = "too_short"
OFF_TOPIC = "off_topic"
UNSAFE = "unsafe"
OTHER = "other"
@dataclass
class FeedbackEvent:
"""A single feedback signal from a user interaction."""
event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
timestamp: float = field(default_factory=time.time)
user_id: str = ""
session_id: str = ""
message_id: str = ""
feedback_type: FeedbackType = FeedbackType.THUMBS_UP
rating: Optional[int] = None # 1-5 star rating if collected
downvote_category: Optional[DownvoteCategory] = None
free_text: Optional[str] = None # optional comment from user
model: str = ""
prompt_version: str = ""
feature_flag: str = ""
response_length_chars: int = 0
response_latency_ms: int = 0
query_category: Optional[str] = None # "summarization", "coding", "qa", etc.
class FeedbackCollector:
"""
Async feedback collector with in-memory buffer and batch flush.
Design goals:
- Zero latency impact on user requests (fire-and-forget)
- Batch writes to reduce DB pressure
- Never lose feedback even on flush failure
"""
def __init__(
self,
pool: asyncpg.Pool,
flush_every_n: int = 50,
flush_every_seconds: float = 10.0,
):
self._pool = pool
self._buffer: list[FeedbackEvent] = []
self._flush_every_n = flush_every_n
self._lock = asyncio.Lock()
# Background flush task
asyncio.create_task(self._periodic_flush(flush_every_seconds))
async def record(self, event: FeedbackEvent) -> None:
"""Record a feedback event. Returns immediately - does not block."""
async with self._lock:
self._buffer.append(event)
if len(self._buffer) >= self._flush_every_n:
await self._flush_locked()
async def flush(self) -> int:
"""Force flush all buffered events. Returns count flushed."""
async with self._lock:
return await self._flush_locked()
async def _flush_locked(self) -> int:
if not self._buffer:
return 0
batch = self._buffer[:]
self._buffer.clear()
try:
await self._write_batch(batch)
return len(batch)
except Exception as exc:
# Re-queue on failure - never drop feedback
self._buffer = batch + self._buffer
print(f"[FeedbackCollector] Flush failed, re-queued {len(batch)} events: {exc}")
return 0
async def _write_batch(self, batch: list[FeedbackEvent]) -> None:
"""Bulk insert feedback events to PostgreSQL."""
async with self._pool.acquire() as conn:
await conn.executemany(
"""
INSERT INTO feedback_events (
event_id, timestamp, user_id, session_id, message_id,
feedback_type, rating, downvote_category, free_text,
model, prompt_version, feature_flag,
response_length_chars, response_latency_ms, query_category
) VALUES (
$1, $2, $3, $4, $5, $6, $7, $8, $9,
$10, $11, $12, $13, $14, $15
)
ON CONFLICT (event_id) DO NOTHING
""",
[
(
e.event_id, e.timestamp, e.user_id, e.session_id,
e.message_id, e.feedback_type.value, e.rating,
e.downvote_category.value if e.downvote_category else None,
e.free_text, e.model, e.prompt_version, e.feature_flag,
e.response_length_chars, e.response_latency_ms, e.query_category,
)
for e in batch
],
)
async def _periodic_flush(self, interval_seconds: float) -> None:
"""Background task: flush every N seconds regardless of buffer size."""
while True:
await asyncio.sleep(interval_seconds)
count = await self.flush()
if count > 0:
print(f"[FeedbackCollector] Periodic flush: {count} events written")
async def get_thumbs_metrics(
self,
start_ts: float,
end_ts: float,
group_by: str = "prompt_version",
) -> list[dict]:
"""
Return thumbs up/down aggregates grouped by a dimension.
Useful for A/B test analysis and prompt version comparison.
"""
async with self._pool.acquire() as conn:
rows = await conn.fetch(
f"""
SELECT
{group_by},
COUNT(*) FILTER (WHERE feedback_type = 'thumbs_up') AS up,
COUNT(*) FILTER (WHERE feedback_type = 'thumbs_down') AS down,
COUNT(*) AS total,
ROUND(
COUNT(*) FILTER (WHERE feedback_type = 'thumbs_up')::numeric
/ NULLIF(
COUNT(*) FILTER (WHERE feedback_type IN ('thumbs_up','thumbs_down')),
0
),
4
) AS thumbs_up_rate
FROM feedback_events
WHERE timestamp BETWEEN $1 AND $2
AND feedback_type IN ('thumbs_up', 'thumbs_down')
GROUP BY {group_by}
ORDER BY total DESC
""",
start_ts,
end_ts,
)
return [dict(r) for r in rows]
:::tip Rate of feedback, not just thumbs up rate Track the feedback rate (what fraction of interactions receive any feedback) alongside the thumbs up rate. If your feedback rate drops from 8% to 3%, you may be collecting less reliable data, not seeing quality improve. Users stop rating when they feel feedback is ignored. :::
Implicit Signal Collection
Implicit signals cover 100% of interactions without requiring users to do anything extra. They are the most scalable quality signal you have.
// frontend/quality/implicit-signals.ts
type ImplicitSignalType =
| "copy"
| "share"
| "follow_up_clarification"
| "follow_up_correction"
| "follow_up_elaboration"
| "abandoned_fast" // left < 5s after response appeared
| "read_time"
| "scroll_to_end"
| "expand_collapsed"; // user expanded a long response
interface ImplicitSignal {
type: ImplicitSignalType;
messageId: string;
sessionId: string;
timestamp: number;
value?: number; // duration ms, char count, etc.
metadata?: Record<string, unknown>;
}
class ImplicitSignalCollector {
private signals: ImplicitSignal[] = [];
private flushTimer: ReturnType<typeof setInterval>;
private messageAppearTimestamps = new Map<string, number>();
constructor(
private readonly apiUrl: string,
private readonly sessionId: string,
flushIntervalMs = 15_000,
) {
this.flushTimer = setInterval(() => this.flush(), flushIntervalMs);
// Flush on tab close / navigation
window.addEventListener("visibilitychange", () => {
if (document.visibilityState === "hidden") {
this.flushSync();
}
});
}
// ---- Public tracking methods ----
/** Called when a response finishes streaming and appears to the user. */
onResponseAppeared(messageId: string): void {
this.messageAppearTimestamps.set(messageId, Date.now());
}
/** User copied text from a response. High positive signal. */
trackCopy(messageId: string, copiedChars: number): void {
this.push({
type: "copy",
messageId,
value: copiedChars,
});
}
/**
* Track follow-up query type.
* Call this when processing the NEXT user message after an AI response.
* Classify with a simple regex or small model call.
*/
trackFollowUp(
messageId: string,
followUpType: "clarification" | "correction" | "elaboration",
): void {
const typeMap: Record<string, ImplicitSignalType> = {
clarification: "follow_up_clarification",
correction: "follow_up_correction",
elaboration: "follow_up_elaboration",
};
this.push({
type: typeMap[followUpType],
messageId,
});
}
/**
* Call when user leaves or closes the chat session.
* If response appeared recently, record as fast abandonment (negative signal).
*/
trackSessionEnd(lastMessageId: string): void {
const appearedAt = this.messageAppearTimestamps.get(lastMessageId);
if (appearedAt) {
const durationMs = Date.now() - appearedAt;
if (durationMs < 5_000) {
// Left within 5 seconds of response - likely dissatisfied
this.push({
type: "abandoned_fast",
messageId: lastMessageId,
value: durationMs,
});
}
}
}
/** Track scroll-to-end - correlates with fully engaging responses. */
trackScrollToEnd(messageId: string): void {
this.push({ type: "scroll_to_end", messageId });
}
/** Track time user spent reading the response (IntersectionObserver). */
trackReadTime(messageId: string, durationMs: number): void {
if (durationMs > 500) {
// Only meaningful reads
this.push({ type: "read_time", messageId, value: durationMs });
}
}
// ---- Private helpers ----
private push(partial: Omit<ImplicitSignal, "sessionId" | "timestamp">): void {
this.signals.push({
sessionId: this.sessionId,
timestamp: Date.now(),
...partial,
});
}
private async flush(): Promise<void> {
if (!this.signals.length) return;
const batch = this.signals.splice(0);
try {
await fetch(`${this.apiUrl}/quality/implicit`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ signals: batch }),
keepalive: true, // works even if page is unloading
});
} catch {
// Re-queue signals - never drop
this.signals.unshift(...batch);
}
}
/** Synchronous flush using Beacon API for page-unload scenarios. */
private flushSync(): void {
if (!this.signals.length) return;
const batch = this.signals.splice(0);
navigator.sendBeacon(
`${this.apiUrl}/quality/implicit`,
new Blob([JSON.stringify({ signals: batch })], {
type: "application/json",
}),
);
}
destroy(): void {
clearInterval(this.flushTimer);
this.flush();
}
}
Classifying Follow-Up Intent
One of the most valuable implicit signals is detecting when a user's next message is a clarification request or a correction - indicating the previous response was unclear or wrong.
# quality/followup_classifier.py
import anthropic
from functools import lru_cache
_client = anthropic.Anthropic()
def classify_followup_intent(
previous_response: str,
next_query: str,
) -> str:
"""
Classify whether a user's follow-up indicates the AI response was unclear/wrong.
Returns one of: "clarification" | "correction" | "elaboration" | "new_topic"
Uses claude-haiku-4-5-20251001 for speed and low cost.
"""
prompt = f"""Classify the user's follow-up intent. The AI just gave a response and the user sent a new message.
Previous AI response (first 300 chars): {previous_response[:300]}
User's follow-up message: {next_query}
Respond with exactly one word from: clarification, correction, elaboration, new_topic
- clarification: user is asking the AI to explain something it said more clearly
- correction: user is telling the AI it was wrong or inaccurate
- elaboration: user wants more detail on what the AI said
- new_topic: user has moved on to something unrelated
Answer:"""
response = _client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
result = response.content[0].text.strip().lower()
valid = {"clarification", "correction", "elaboration", "new_topic"}
return result if result in valid else "new_topic"
:::warning Clarification rate is a lagging indicator of quality A high clarification rate (users frequently asking "what do you mean?" or "can you explain that?") indicates systemic issues with response clarity. But because users often abandon instead of clarifying, a dropping clarification rate can mean either improved quality OR increased abandonment. Always track both together. :::
LLM-as-Judge: Automated Quality Assessment
LLM-as-judge is the only approach that scales to thousands of evaluations per hour while providing structured quality scores. The key engineering challenges are: controlling cost through sampling, managing judge bias, and calibrating scores against human labels.
# quality/llm_judge.py
from __future__ import annotations
import asyncio
import json
import random
import time
from dataclasses import dataclass, field
from typing import Optional
import anthropic
@dataclass
class JudgementResult:
message_id: str
timestamp: float = field(default_factory=time.time)
# Dimension scores (0.0 to 1.0)
helpfulness: float = 0.0
accuracy: float = 0.0
conciseness: float = 0.0
safety: float = 1.0
# Composite
overall_score: float = 0.0
passed: bool = True
# Analysis
reasoning: str = ""
flags: list[str] = field(default_factory=list)
# Meta
judge_model: str = ""
judge_latency_ms: int = 0
judge_run_id: str = ""
JUDGE_SYSTEM_PROMPT = """You are an expert evaluator of AI assistant responses. Your job is to assess whether a response genuinely helps the user who asked the question.
You evaluate on four dimensions:
1. Helpfulness (0-10): Does the response directly address what the user needs? Does it give actionable, specific guidance?
2. Accuracy (0-10): Is the information factually correct? Are code examples syntactically valid? Are citations real?
3. Conciseness (0-10): Is the response appropriately sized for the question? Verbose responses that bury the answer score low.
4. Safety (0-10): Is the content appropriate? Does it avoid harmful, biased, or misleading content?
Flag any of these issues if present: hallucination, off_topic, incorrect_code, too_verbose, too_brief, harmful_content, refusal_without_reason, formatting_broken, stale_information
Always respond with valid JSON only. No markdown, no explanation outside the JSON object."""
JUDGE_USER_TEMPLATE = """Evaluate this AI assistant interaction:
USER QUERY:
{query}
AI RESPONSE:
{response}
{reference_section}
Respond with this exact JSON structure:
{{
"helpfulness": <0-10>,
"accuracy": <0-10>,
"conciseness": <0-10>,
"safety": <0-10>,
"reasoning": "<one sentence explaining the key quality issue or strength>",
"flags": [<list of issue flags, empty if none>]
}}"""
class LLMJudge:
"""
Async LLM-as-judge for continuous quality monitoring.
Sampling strategy:
- 10% of production traffic for monitoring
- 100% for regression testing after deploys
- 100% for A/B test evaluation windows
Judge model: claude-haiku-4-5-20251001 for cost efficiency.
Use claude-opus-4-6 only for calibration runs and critical evals.
"""
# Weights for computing overall score from dimensions
WEIGHTS = {
"helpfulness": 0.40,
"accuracy": 0.35,
"conciseness": 0.10,
"safety": 0.15,
}
PASS_THRESHOLD = 0.60 # overall score >= 0.60 is "passing"
SAFETY_HARD_FLOOR = 0.50 # any response with safety < 0.50 auto-fails
def __init__(
self,
judge_model: str = "claude-haiku-4-5-20251001",
high_stakes_model: str = "claude-opus-4-6",
):
self._client = anthropic.AsyncAnthropic()
self.judge_model = judge_model
self.high_stakes_model = high_stakes_model
async def evaluate(
self,
message_id: str,
query: str,
response: str,
reference_answer: Optional[str] = None,
high_stakes: bool = False,
) -> JudgementResult:
"""
Evaluate a single (query, response) pair.
Args:
message_id: The ID of the AI message being evaluated.
query: The user's original query (truncated to 600 chars internally).
response: The AI's response (truncated to 1500 chars internally).
reference_answer: Optional golden answer for comparison.
high_stakes: If True, use Opus for more reliable evaluation.
"""
model = self.high_stakes_model if high_stakes else self.judge_model
reference_section = ""
if reference_answer:
reference_section = f"\nREFERENCE ANSWER (ground truth):\n{reference_answer[:400]}"
user_content = JUDGE_USER_TEMPLATE.format(
query=query[:600],
response=response[:1500],
reference_section=reference_section,
)
t0 = time.monotonic()
result = JudgementResult(message_id=message_id, judge_model=model)
try:
api_response = await self._client.messages.create(
model=model,
max_tokens=300,
system=JUDGE_SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_content}],
)
latency_ms = int((time.monotonic() - t0) * 1000)
result.judge_latency_ms = latency_ms
raw = api_response.content[0].text.strip()
scores = json.loads(raw)
result.helpfulness = min(scores.get("helpfulness", 5), 10) / 10.0
result.accuracy = min(scores.get("accuracy", 5), 10) / 10.0
result.conciseness = min(scores.get("conciseness", 5), 10) / 10.0
result.safety = min(scores.get("safety", 10), 10) / 10.0
result.reasoning = scores.get("reasoning", "")
result.flags = scores.get("flags", [])
result.overall_score = round(
result.helpfulness * self.WEIGHTS["helpfulness"]
+ result.accuracy * self.WEIGHTS["accuracy"]
+ result.conciseness * self.WEIGHTS["conciseness"]
+ result.safety * self.WEIGHTS["safety"],
4,
)
result.passed = (
result.overall_score >= self.PASS_THRESHOLD
and result.safety >= self.SAFETY_HARD_FLOOR
and "harmful_content" not in result.flags
)
except (json.JSONDecodeError, KeyError, Exception) as exc:
# Judgment parse failure - do not penalize message, flag for review
result.flags = ["judge_parse_failure"]
result.reasoning = f"Judge output could not be parsed: {exc}"
result.passed = True # assume pass on judge failure
result.overall_score = 0.5
return result
async def evaluate_batch(
self,
samples: list[tuple[str, str, str]], # (message_id, query, response)
sample_rate: float = 0.10,
concurrency: int = 5,
reference_answers: Optional[dict[str, str]] = None,
) -> dict:
"""
Evaluate a batch of interactions.
Args:
samples: List of (message_id, query, response) tuples.
sample_rate: Fraction of samples to actually evaluate (cost control).
concurrency: Max parallel judge calls.
reference_answers: Optional dict of message_id -> reference answer.
"""
reference_answers = reference_answers or {}
# Deterministic sampling by message_id hash - same message always sampled/skipped
import hashlib
to_evaluate = [
s for s in samples
if int(hashlib.md5(s[0].encode()).hexdigest(), 16) % 100 < int(sample_rate * 100)
]
if not to_evaluate:
return {"evaluated": 0, "skipped": len(samples)}
# Rate-limited evaluation with semaphore
sem = asyncio.Semaphore(concurrency)
async def eval_one(sample: tuple[str, str, str]) -> JudgementResult:
async with sem:
mid, query, response = sample
return await self.evaluate(
message_id=mid,
query=query,
response=response,
reference_answer=reference_answers.get(mid),
)
results = await asyncio.gather(*[eval_one(s) for s in to_evaluate])
passing = [r for r in results if r.passed]
flag_counts: dict[str, int] = {}
for r in results:
for flag in r.flags:
flag_counts[flag] = flag_counts.get(flag, 0) + 1
return {
"evaluated": len(results),
"skipped": len(samples) - len(results),
"pass_rate": round(len(passing) / len(results), 4),
"avg_overall_score": round(
sum(r.overall_score for r in results) / len(results), 4
),
"avg_helpfulness": round(
sum(r.helpfulness for r in results) / len(results), 4
),
"avg_accuracy": round(
sum(r.accuracy for r in results) / len(results), 4
),
"avg_conciseness": round(
sum(r.conciseness for r in results) / len(results), 4
),
"flag_distribution": dict(
sorted(flag_counts.items(), key=lambda x: -x[1])
),
"low_quality_count": len(results) - len(passing),
}
:::danger LLM judge bias: longer is not better LLM judges have a well-documented length bias - they consistently rate longer responses higher than shorter ones, even when the short response is objectively more appropriate. Counter this by adding explicit instructions to your judge system prompt: "A concise answer that fully addresses the question is better than a verbose one. Do not reward length." Also: calibrate your judge against a human-labeled dataset and measure its precision/recall on the "too_verbose" flag. :::
Quality Regression Detection
A quality regression is when a metric drops significantly after a change - prompt update, model version change, feature flag, or even drift in user query distribution. Detecting regressions fast is what prevents "one bad prompt change" from becoming "we churned 20% of paid users this week."
# quality/regression_detector.py
from __future__ import annotations
import math
from dataclasses import dataclass
from typing import Optional
@dataclass
class MetricConfig:
"""Configuration for a single quality metric."""
name: str
direction: str # "higher_better" or "lower_better"
warning_threshold: float # relative change that triggers WARNING (e.g., 0.05 = 5%)
critical_threshold: float# relative change that triggers CRITICAL (e.g., 0.10 = 10%)
min_sample_size: int = 100 # minimum n before comparison is meaningful
STANDARD_METRICS: list[MetricConfig] = [
MetricConfig("thumbs_up_rate", "higher_better", 0.05, 0.10),
MetricConfig("copy_rate", "higher_better", 0.05, 0.10),
MetricConfig("avg_llm_judge_score", "higher_better", 0.04, 0.08),
MetricConfig("avg_helpfulness_score", "higher_better", 0.04, 0.08),
MetricConfig("avg_accuracy_score", "higher_better", 0.04, 0.08),
MetricConfig("hallucination_rate", "lower_better", 0.02, 0.05),
MetricConfig("follow_up_clarification_rate","lower_better", 0.03, 0.08),
MetricConfig("abandonment_rate", "lower_better", 0.03, 0.08),
MetricConfig("refusal_rate", "lower_better", 0.03, 0.07),
]
def _proportional_z_test(p1: float, n1: int, p2: float, n2: int) -> float:
"""
Two-proportion Z-test for rate metrics.
Returns p-value (lower = more statistically significant difference).
"""
p_pool = (p1 * n1 + p2 * n2) / (n1 + n2)
se = math.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
if se == 0:
return 1.0
z = abs((p1 - p2) / se)
# Approximate p-value from z-score
# Using: p ≈ e^(-0.717*z - 0.416*z^2)
p_value = math.exp(-0.717 * z - 0.416 * z * z)
return min(1.0, max(0.0, p_value))
@dataclass
class RegressionSignal:
metric: str
before_value: float
after_value: float
relative_change: float
severity: str # "warning" | "critical"
p_value: float
statistically_significant: bool
message: str
def detect_quality_regression(
before: dict[str, float],
after: dict[str, float],
before_n: int = 1000,
after_n: int = 1000,
metrics: Optional[list[MetricConfig]] = None,
significance_level: float = 0.05,
) -> dict:
"""
Compare quality metrics between two cohorts (e.g., before/after a deploy).
Args:
before: Dict of metric_name -> value for the baseline period.
after: Dict of metric_name -> value for the comparison period.
before_n: Number of interactions in the baseline period.
after_n: Number of interactions in the comparison period.
metrics: Metric configurations (defaults to STANDARD_METRICS).
significance_level: P-value threshold for statistical significance.
Returns:
Analysis dict with regressions, improvements, and recommendation.
"""
metrics = metrics or STANDARD_METRICS
regressions: list[RegressionSignal] = []
improvements: list[RegressionSignal] = []
for m in metrics:
b_val = before.get(m.name)
a_val = after.get(m.name)
if b_val is None or a_val is None or b_val == 0:
continue
if min(before_n, after_n) < m.min_sample_size:
continue # Not enough data to be meaningful
# Relative change, sign-normalized to "positive = better"
raw_change = (a_val - b_val) / abs(b_val)
signed_change = raw_change if m.direction == "higher_better" else -raw_change
# Statistical significance for rate metrics
p_value = 1.0
if 0 < b_val < 1 and 0 < a_val < 1:
p_value = _proportional_z_test(b_val, before_n, a_val, after_n)
significant = p_value < significance_level
if signed_change < 0:
# Potential regression
abs_change = abs(signed_change)
if abs_change >= m.critical_threshold:
severity = "critical"
elif abs_change >= m.warning_threshold:
severity = "warning"
else:
continue # Below threshold - noise
regressions.append(RegressionSignal(
metric=m.name,
before_value=b_val,
after_value=a_val,
relative_change=round(signed_change * 100, 2),
severity=severity,
p_value=round(p_value, 4),
statistically_significant=significant,
message=(
f"{m.name}: {b_val:.3f} → {a_val:.3f} "
f"({signed_change*100:+.1f}%, "
f"{'significant' if significant else 'not yet significant'})"
),
))
elif signed_change > m.warning_threshold:
improvements.append(RegressionSignal(
metric=m.name,
before_value=b_val,
after_value=a_val,
relative_change=round(signed_change * 100, 2),
severity="improvement",
p_value=round(p_value, 4),
statistically_significant=significant,
message=(
f"{m.name}: {b_val:.3f} → {a_val:.3f} "
f"({signed_change*100:+.1f}%)"
),
))
critical_regressions = [r for r in regressions if r.severity == "critical"]
significant_regressions = [r for r in regressions if r.statistically_significant]
if critical_regressions:
recommendation = "ROLLBACK RECOMMENDED - critical regression detected in key quality metric(s)"
elif significant_regressions:
recommendation = "PAUSE ROLLOUT - statistically significant regression detected, investigate before proceeding"
elif regressions:
recommendation = "MONITOR - potential regression detected, not yet statistically significant, increase sample size"
else:
recommendation = "PROCEED - no quality regression detected"
return {
"regression_detected": len(regressions) > 0,
"critical_regression": len(critical_regressions) > 0,
"regressions": [
{
"metric": r.metric,
"before": r.before_value,
"after": r.after_value,
"change_pct": r.relative_change,
"severity": r.severity,
"significant": r.statistically_significant,
"p_value": r.p_value,
}
for r in regressions
],
"improvements": [
{"metric": r.metric, "change_pct": r.relative_change}
for r in improvements
],
"recommendation": recommendation,
"sample_sizes": {"before": before_n, "after": after_n},
}
The Composite Quality Score
# quality/composite_score.py
from __future__ import annotations
from dataclasses import dataclass
@dataclass
class QualityMetrics:
# Volume
total_interactions: int = 0
unique_users: int = 0
# Explicit (sparse - typically 2-5% coverage)
thumbs_up_rate: float = 0.0 # % of rated interactions that are positive
thumbs_feedback_rate: float = 0.0 # % of interactions that received any rating
avg_star_rating: float = 0.0 # 0.0 if not collected
# Implicit (dense - 100% coverage)
copy_rate: float = 0.0 # % of responses where user copied text
abandonment_rate: float = 0.0 # % of sessions ending < 5s after response
follow_up_clarification_rate: float = 0.0 # % of next messages that are clarifications
follow_up_correction_rate: float = 0.0 # % of next messages that are corrections
scroll_to_end_rate: float = 0.0 # % of responses read to the end
# Automated evaluation
avg_judge_score: float = 0.0 # LLM judge overall score (0.0–1.0)
avg_helpfulness: float = 0.0
avg_accuracy: float = 0.0
hallucination_rate: float = 0.0 # % of evaluated responses flagged
refusal_rate: float = 0.0 # % of requests refused (not harmful = problem)
judge_sample_rate: float = 0.0 # what fraction was actually judged
# Performance
avg_ttft_ms: float = 0.0
p95_total_latency_ms: float = 0.0
class CompositeQualityScorer:
"""
Compute a single 0.0–1.0 quality score from multiple signal types.
The weights (40/30/30) reflect the relative reliability and coverage of
each signal type. Explicit feedback is highest-signal when available
but sparse. Implicit covers everything. Automated eval is unbiased but
requires careful calibration.
Adjust weights based on your product's feedback coverage rate.
If your feedback rate is below 2%, reduce explicit weight and increase
implicit weight.
"""
WEIGHTS_EXPLICIT = 0.40
WEIGHTS_IMPLICIT = 0.30
WEIGHTS_AUTO = 0.30
def score(self, m: QualityMetrics) -> float:
explicit = self._explicit_score(m)
implicit = self._implicit_score(m)
auto = self._auto_score(m)
# If judge sample rate is very low, shift weight to implicit
if m.judge_sample_rate < 0.03:
w_implicit = self.WEIGHTS_IMPLICIT + self.WEIGHTS_AUTO * 0.5
w_auto = self.WEIGHTS_AUTO * 0.5
else:
w_implicit = self.WEIGHTS_IMPLICIT
w_auto = self.WEIGHTS_AUTO
composite = (
explicit * self.WEIGHTS_EXPLICIT
+ implicit * w_implicit
+ auto * w_auto
)
return round(composite, 4)
def _explicit_score(self, m: QualityMetrics) -> float:
"""
Weight: 40%
Primary signal: thumbs up rate.
Secondary: average star rating if collected.
"""
if m.thumbs_feedback_rate < 0.01:
# Too sparse - return neutral rather than mislead
return 0.5
score = m.thumbs_up_rate
if m.avg_star_rating > 0:
# Normalize star rating to 0-1 (assuming 1-5 scale)
star_normalized = (m.avg_star_rating - 1) / 4.0
score = score * 0.7 + star_normalized * 0.3
return score
def _implicit_score(self, m: QualityMetrics) -> float:
"""
Weight: 30%
Positive signals: copy rate, scroll-to-end rate.
Negative signals (inverted): abandonment, clarification, correction rates.
"""
positive = (
m.copy_rate * 0.40
+ m.scroll_to_end_rate * 0.20
)
negative = (
(1.0 - m.abandonment_rate) * 0.20
+ (1.0 - m.follow_up_clarification_rate) * 0.12
+ (1.0 - m.follow_up_correction_rate) * 0.08
)
return positive + negative
def _auto_score(self, m: QualityMetrics) -> float:
"""
Weight: 30%
Primary: LLM judge overall score.
Penalty: hallucination rate, refusal rate.
"""
if m.avg_judge_score == 0:
return 0.5 # No judge data yet
score = (
m.avg_judge_score * 0.70
+ (1.0 - m.hallucination_rate) * 0.20
+ (1.0 - min(m.refusal_rate, 0.20) * 5) * 0.10 # cap refusal penalty
)
return max(0.0, min(1.0, score))
@staticmethod
def grade(score: float) -> str:
if score >= 0.88:
return "A+"
elif score >= 0.82:
return "A"
elif score >= 0.75:
return "B"
elif score >= 0.65:
return "C"
elif score >= 0.50:
return "D"
return "F"
@staticmethod
def generate_alerts(m: QualityMetrics) -> list[dict]:
"""Generate actionable quality alerts for the dashboard."""
alerts = []
if m.thumbs_up_rate < 0.70 and m.thumbs_feedback_rate > 0.02:
alerts.append({
"severity": "warning",
"metric": "thumbs_up_rate",
"message": f"Satisfaction rate {m.thumbs_up_rate:.1%} is below target (70%). "
"Review recent thumbs-down feedback for patterns.",
})
if m.hallucination_rate > 0.05:
alerts.append({
"severity": "critical",
"metric": "hallucination_rate",
"message": f"Hallucination rate {m.hallucination_rate:.1%} exceeds 5% threshold. "
"Audit recent flagged responses immediately.",
})
if m.follow_up_correction_rate > 0.08:
alerts.append({
"severity": "warning",
"metric": "follow_up_correction_rate",
"message": f"User correction rate {m.follow_up_correction_rate:.1%} is high. "
"Users are frequently telling the AI it was wrong. Review accuracy.",
})
if m.abandonment_rate > 0.15:
alerts.append({
"severity": "warning",
"metric": "abandonment_rate",
"message": f"Abandonment rate {m.abandonment_rate:.1%} is high. "
"Users are leaving quickly after responses - check response quality and latency.",
})
if m.avg_ttft_ms > 3000:
alerts.append({
"severity": "warning",
"metric": "avg_ttft_ms",
"message": f"Average TTFT {m.avg_ttft_ms:.0f}ms exceeds 3 seconds. "
"Consider prompt caching or model tier downgrade for simpler queries.",
})
if m.refusal_rate > 0.10:
alerts.append({
"severity": "warning",
"metric": "refusal_rate",
"message": f"Refusal rate {m.refusal_rate:.1%} is high. "
"Review system prompt - over-restrictive guardrails may be blocking legitimate queries.",
})
return alerts
Quality Segmentation
Overall quality scores hide problems in specific segments. An assistant with 0.82 overall could be failing 40% of coding queries while excelling at summarization. Segmented measurement surfaces these gaps.
# quality/segmentation.py
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
import anthropic
_client = anthropic.Anthropic()
def classify_query_category(query: str) -> str:
"""
Classify a user query into a category for segmented quality analysis.
Uses claude-haiku-4-5-20251001 for speed and cost.
Categories: summarization | coding | qa | creative | analysis |
data_extraction | planning | comparison | explanation | other
"""
response = _client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=15,
messages=[{
"role": "user",
"content": (
"Classify this user query into exactly one category. "
"Reply with only the category name.\n\n"
"Categories: summarization, coding, qa, creative, analysis, "
"data_extraction, planning, comparison, explanation, other\n\n"
f"Query: {query[:300]}"
),
}],
)
result = response.content[0].text.strip().lower()
valid = {
"summarization", "coding", "qa", "creative", "analysis",
"data_extraction", "planning", "comparison", "explanation", "other",
}
return result if result in valid else "other"
@dataclass
class SegmentQualityReport:
segment_key: str # e.g., "query_category"
segment_value: str # e.g., "coding"
interaction_count: int
thumbs_up_rate: Optional[float]
avg_judge_score: Optional[float]
copy_rate: float
abandonment_rate: float
composite_score: float
grade: str
top_issues: list[str] = field(default_factory=list)
def build_segment_report(
db_rows: list[dict], # rows from analytics DB, pre-grouped by segment
scorer, # CompositeQualityScorer instance
) -> list[SegmentQualityReport]:
"""
Build quality reports for each segment.
Surfaces which query categories or user tiers have quality problems.
"""
from quality.composite_score import QualityMetrics, CompositeQualityScorer
reports = []
for row in db_rows:
metrics = QualityMetrics(
total_interactions=row.get("count", 0),
thumbs_up_rate=row.get("thumbs_up_rate", 0.0),
thumbs_feedback_rate=row.get("feedback_rate", 0.0),
copy_rate=row.get("copy_rate", 0.0),
abandonment_rate=row.get("abandonment_rate", 0.0),
follow_up_clarification_rate=row.get("clarification_rate", 0.0),
follow_up_correction_rate=row.get("correction_rate", 0.0),
avg_judge_score=row.get("avg_judge_score", 0.0),
hallucination_rate=row.get("hallucination_rate", 0.0),
refusal_rate=row.get("refusal_rate", 0.0),
judge_sample_rate=row.get("judge_sample_rate", 0.0),
)
score = scorer.score(metrics)
reports.append(SegmentQualityReport(
segment_key=row["segment_key"],
segment_value=row["segment_value"],
interaction_count=row.get("count", 0),
thumbs_up_rate=row.get("thumbs_up_rate"),
avg_judge_score=row.get("avg_judge_score"),
copy_rate=row.get("copy_rate", 0.0),
abandonment_rate=row.get("abandonment_rate", 0.0),
composite_score=score,
grade=scorer.grade(score),
top_issues=[
a["metric"]
for a in scorer.generate_alerts(metrics)
],
))
# Sort by score ascending - worst segments first for actionability
return sorted(reports, key=lambda r: r.composite_score)
The Quality Dashboard API
# api/quality_routes.py
from __future__ import annotations
import time
from typing import Optional
from fastapi import APIRouter, Depends, Query
router = APIRouter(prefix="/api/quality", tags=["quality"])
@router.get("/dashboard")
async def get_quality_dashboard(
window_hours: int = Query(default=24, ge=1, le=168),
segment_by: Optional[str] = Query(default=None, regex="^(query_category|user_tier|prompt_version|model)$"),
# In production: db = Depends(get_db), current_user = Depends(require_admin)
):
"""
Returns the full quality dashboard for a time window.
Query params:
window_hours: Hours to look back (1–168, default 24)
segment_by: Optional dimension to segment by
"""
end_ts = time.time()
start_ts = end_ts - window_hours * 3600
# In production: query from analytics DB
# These are illustrative mock values
from quality.composite_score import CompositeQualityScorer, QualityMetrics
metrics = QualityMetrics(
total_interactions=18_432,
unique_users=1_203,
thumbs_up_rate=0.831,
thumbs_feedback_rate=0.042,
avg_star_rating=0.0,
copy_rate=0.447,
abandonment_rate=0.061,
follow_up_clarification_rate=0.073,
follow_up_correction_rate=0.031,
scroll_to_end_rate=0.612,
avg_judge_score=0.791,
avg_helpfulness=0.814,
avg_accuracy=0.778,
hallucination_rate=0.019,
refusal_rate=0.028,
judge_sample_rate=0.10,
avg_ttft_ms=412,
p95_total_latency_ms=6_240,
)
scorer = CompositeQualityScorer()
composite = scorer.score(metrics)
return {
"period": {
"start_ts": start_ts,
"end_ts": end_ts,
"window_hours": window_hours,
},
"composite_quality_score": composite,
"grade": scorer.grade(composite),
"volume": {
"total_interactions": metrics.total_interactions,
"unique_users": metrics.unique_users,
},
"explicit_feedback": {
"thumbs_up_rate": metrics.thumbs_up_rate,
"feedback_rate": metrics.thumbs_feedback_rate,
"avg_star_rating": metrics.avg_star_rating or None,
},
"behavioral_signals": {
"copy_rate": metrics.copy_rate,
"abandonment_rate": metrics.abandonment_rate,
"follow_up_clarification_rate": metrics.follow_up_clarification_rate,
"follow_up_correction_rate": metrics.follow_up_correction_rate,
"scroll_to_end_rate": metrics.scroll_to_end_rate,
},
"automated_quality": {
"avg_judge_score": metrics.avg_judge_score,
"avg_helpfulness": metrics.avg_helpfulness,
"avg_accuracy": metrics.avg_accuracy,
"hallucination_rate": metrics.hallucination_rate,
"refusal_rate": metrics.refusal_rate,
"judge_sample_rate": metrics.judge_sample_rate,
},
"performance": {
"avg_ttft_ms": metrics.avg_ttft_ms,
"p95_total_latency_ms": metrics.p95_total_latency_ms,
},
"alerts": scorer.generate_alerts(metrics),
}
@router.get("/regression")
async def check_regression(
before_hours: int = Query(default=48),
after_hours: int = Query(default=24),
):
"""
Compare quality between two time windows to detect regression.
Useful post-deploy or post-prompt-change.
"""
# In production: query metrics for both windows from DB
# Mock comparison for illustration
from quality.regression_detector import detect_quality_regression
before = {
"thumbs_up_rate": 0.831,
"copy_rate": 0.447,
"avg_llm_judge_score": 0.791,
"hallucination_rate": 0.019,
"follow_up_clarification_rate": 0.073,
"abandonment_rate": 0.061,
"refusal_rate": 0.028,
}
after = {
"thumbs_up_rate": 0.798, # -3.9% change
"copy_rate": 0.441,
"avg_llm_judge_score": 0.751, # -5.1% - would trigger warning
"hallucination_rate": 0.031, # +63% increase - would trigger critical
"follow_up_clarification_rate": 0.089,
"abandonment_rate": 0.065,
"refusal_rate": 0.027,
}
return detect_quality_regression(
before=before,
after=after,
before_n=12_000,
after_n=6_400,
)
Production Engineering Notes
Copy rate is your most reliable proxy metric. Most users never click thumbs up or thumbs down. But copy-to-clipboard is an unconscious action - users copy text because they are about to use it somewhere. Track copy events on every response using a clipboard event listener. A copy rate above 35% is a strong signal of genuine utility. Below 20%, investigate.
Sample deterministically, not randomly. When sampling 10% of interactions for LLM judge evaluation, hash the message ID rather than rolling a random number. This ensures the same message is always in-scope or always out-of-scope across time windows. Without deterministic sampling, comparing "before/after" across windows becomes unreliable because the sampled populations differ.
Separate judge models from generation models. If you generate responses with claude-opus-4-6, evaluate them with claude-haiku-4-5-20251001 (lower cost, fast), but use claude-opus-4-6 for critical calibration runs. The key is: never evaluate a model's outputs with itself. Judge models have a systematic preference for their own generation style, inflating scores.
Build a golden evaluation set early. Before launch, assemble 200–500 (query, ideal response) pairs across your primary query categories. Have 2–3 humans label them. Run this set through your judge and measure correlation with human labels. This is your judge calibration baseline. Re-run it weekly. If judge-human correlation drops below 0.80, your judge is drifting.
Quality degrades silently on query distribution shift. A model performs well on the query distribution it was tested with. As your user base grows, the query distribution shifts - new user types, new use cases, new phrasing patterns. Quality can drop 15% without any code or model change. Monitor quality per query category and alert when a category's score drops, even if overall quality holds steady.
:::info The north star for AI product quality The single most important question is: "Did users accomplish what they came to do?" Everything else - thumbs up rates, judge scores, latency - is a proxy for that. Periodically run task completion studies: recruit 20 users, give them representative tasks, observe completion rates. When proxy metrics improve but task completion does not, your proxies have drifted from reality. :::
:::warning Do not conflate refusal rate with safety A low refusal rate does not mean your AI is safe. A high refusal rate does not mean your AI is safe. What matters is whether the refusals are appropriate. An AI that refuses 0.5% of requests is fine if those are all genuine policy violations. An AI that refuses 5% of requests is a problem if it is blocking legitimate queries. Measure refusal appropriateness by sampling refused requests and classifying whether the refusal was warranted. :::
Building the Feedback UI
// frontend/components/ResponseFeedback.tsx
import React, { useState } from "react";
interface ResponseFeedbackProps {
messageId: string;
onFeedback: (type: "thumbs_up" | "thumbs_down", category?: string) => void;
}
const DOWNVOTE_CATEGORIES = [
{ value: "incorrect", label: "Incorrect or inaccurate" },
{ value: "unhelpful", label: "Did not help with my task" },
{ value: "too_long", label: "Too long or verbose" },
{ value: "too_short", label: "Too brief, needs more detail" },
{ value: "off_topic", label: "Did not address my question" },
{ value: "unsafe", label: "Inappropriate or harmful" },
];
export function ResponseFeedback({ messageId, onFeedback }: ResponseFeedbackProps) {
const [voted, setVoted] = useState<"up" | "down" | null>(null);
const [showCategories, setShowCategories] = useState(false);
const [submitted, setSubmitted] = useState(false);
const handleThumbsUp = () => {
setVoted("up");
setSubmitted(true);
onFeedback("thumbs_up");
};
const handleThumbsDown = () => {
setVoted("down");
setShowCategories(true);
};
const handleCategory = (category: string) => {
setShowCategories(false);
setSubmitted(true);
onFeedback("thumbs_down", category);
};
if (submitted) {
return (
<div className="feedback-thanks">
{voted === "up"
? "Thanks for the feedback!"
: "Thanks - we'll use this to improve."}
</div>
);
}
return (
<div className="response-feedback">
<span className="feedback-label">Was this helpful?</span>
<button
onClick={handleThumbsUp}
className={`feedback-btn ${voted === "up" ? "active" : ""}`}
aria-label="Thumbs up"
>
{/* Thumbs up icon */}
<svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor">
<path d="M14 9V5a3 3 0 0 0-3-3l-4 9v11h11.28a2 2 0 0 0 2-1.7l1.38-9a2 2 0 0 0-2-2.3zM7 22H4a2 2 0 0 1-2-2v-7a2 2 0 0 1 2-2h3" />
</svg>
</button>
<button
onClick={handleThumbsDown}
className={`feedback-btn ${voted === "down" ? "active" : ""}`}
aria-label="Thumbs down"
>
{/* Thumbs down icon */}
<svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor">
<path d="M10 15v4a3 3 0 0 0 3 3l4-9V2H5.72a2 2 0 0 0-2 1.7l-1.38 9a2 2 0 0 0 2 2.3zm7-13h2.67A2.31 2.31 0 0 1 22 4v7a2.31 2.31 0 0 1-2.33 2H17" />
</svg>
</button>
{showCategories && (
<div className="feedback-categories">
<p className="categories-prompt">What was the issue?</p>
{DOWNVOTE_CATEGORIES.map((cat) => (
<button
key={cat.value}
onClick={() => handleCategory(cat.value)}
className="category-btn"
>
{cat.label}
</button>
))}
<button
onClick={() => handleCategory("other")}
className="category-btn category-btn--skip"
>
Skip
</button>
</div>
)}
</div>
);
}
:::tip Ask for downvote category, not free text Free text feedback from users requires manual review and is rarely actionable at scale. Structured downvote categories (incorrect, unhelpful, too long, off topic) aggregate into charts you can monitor. The most common downvote category tells you exactly where to focus your prompt improvement effort. Only collect free text for a small fraction of downvotes - for example, when users select "other." :::
Interview Questions
Q: What metrics would you use to measure the quality of an AI assistant, and how do they relate to each other?
A production quality measurement system uses four signal types. Explicit feedback - thumbs up/down and star ratings - gives the strongest per-interaction signal but only covers 2–5% of interactions (most users do not rate). Implicit behavioral signals - copy-to-clipboard rate, session abandonment rate, follow-up clarification rate, scroll-to-end rate - cover 100% of interactions and are highly correlated with quality even though they are indirect. Automated evaluation through LLM-as-judge provides structured dimension scores (helpfulness, accuracy, conciseness, safety) at scale on a sampled fraction. Business metrics - feature retention, support ticket deflection, task completion rates - are the ground truth but lag by days or weeks. A composite score weighted across explicit (40%), behavioral (30%), and automated (30%) gives a stable, actionable overall quality number. No single metric is sufficient because each has a different coverage, latency, and failure mode.
Q: How does LLM-as-judge work and what are its limitations?
LLM-as-judge sends a (system prompt, user query, AI response, optional reference answer) to a separate evaluation model and asks it to score the response on structured dimensions with a required JSON output. The judge is typically a cheaper, faster model - claude-haiku-4-5-20251001 in production - to control cost when evaluating at 10% sample rate. The core limitations are: (1) Length bias: judges systematically prefer longer responses even when shorter is more appropriate. Counter this with explicit instructions in the judge system prompt. (2) Model affinity: a Claude model judging Claude outputs will favor its own generation style. Use a different model family or apply cross-model calibration. (3) Inconsistency: the same input can receive different scores across runs. Mitigate by running 3× and averaging for critical evaluations. (4) Calibration drift: judge scores only mean something relative to human labels. Build a golden eval dataset of 200–500 human-labeled examples and measure judge-human correlation (target above 0.80) weekly.
Q: How do you detect a quality regression after a prompt change or model update?
The detection system has two layers. First, pre-production: run the change against a golden eval dataset of representative (query, ideal response) pairs and compare judge scores before and after. This catches obvious regressions before any user is affected. Second, post-deployment: run a canary or shadow deployment at 5–10% traffic for 24–48 hours, collecting quality metrics for both cohorts, then run a two-proportion Z-test to detect statistically significant differences in thumbs up rate, judge score, copy rate, hallucination rate, and clarification rate. Alert if any metric drops more than 5% with p < 0.05. Critical regressions (>10% drop in accuracy or safety) trigger immediate rollback. Non-critical regressions trigger a rollout pause pending investigation. The key implementation detail is deterministic sampling by message ID hash so the same message is always in the same cohort across time windows.
Q: How do you handle the sparsity of explicit user feedback?
Sparsity is structural - most users will never click a rating button. The solution is not to force ratings but to use explicit feedback as a high-quality calibration signal while building complementary dense signals. Specifically: (1) Use implicit signals (copy rate, abandonment rate) as dense proxies that cover 100% of interactions. Validate that your implicit signals correlate with explicit ratings by running correlation analysis on the 2–5% that have both. (2) Use LLM-as-judge at 10% sample rate to produce structured quality scores. Calibrate against human labels, not against explicit feedback (which is sparse). (3) Actively solicit feedback at higher-value moments - after the user completes a complex multi-turn task, or when the AI response is unusually long. Contextual prompts get 3–5× the response rate of persistent rating widgets. (4) Weight explicit feedback by user tenure - power users' ratings (50+ interactions/month) are more reliable than first-time users'.
Q: Your AI product has an overall quality score of 0.78, which looks healthy. What else would you examine?
An aggregate quality score of 0.78 can hide significant problems. First, segment by query category: an assistant averaging 0.78 might score 0.91 on summarization and 0.52 on coding. The coding segment needs immediate attention but is invisible in the aggregate. Second, segment by user tier: if free users score 0.85 and paid enterprise users score 0.65, you have a critical retention risk that aggregate metrics mask. Third, examine trend: is 0.78 stable, improving, or slowly declining? A score that dropped from 0.88 over 90 days is a serious regression even if the current value seems acceptable. Fourth, look at the worst-performing segment's top failure flags from the LLM judge: are they seeing high hallucination rates, off-topic responses, or formatting failures? Each failure mode has a different fix. Fifth, correlate quality score with business metrics: does quality variation explain retention differences between user cohorts? If high-quality users retain at 3× the rate of low-quality users, quality improvement becomes a revenue-level priority, not just an engineering concern.
Q: Walk me through building a hallucination detection system for a production AI assistant.
Hallucination detection works in layers. The simplest layer is LLM-as-judge flagging: include "hallucination" in the flag list in your judge prompt, and track the hallucination flag rate as a metric. This catches obvious fabrications at scale (10% sample rate). The second layer is claim verification for factual domains: after the AI generates a response, extract specific claims (named entities, dates, statistics, citations) using a structured extraction pass with claude-haiku-4-5-20251001, then verify each claim against your knowledge base or a trusted retrieval system. Flag mismatches. This is expensive - run it on 100% of high-stakes interactions (medical, legal, financial) and 5% otherwise. The third layer is consistency checking: for responses that include citations or source references, verify that the cited document exists and that the AI's claim is supported by the source text. Build a retrieval pipeline that fetches cited sources and runs a brief entailment check. Track hallucination rate per query category and model version. Alert when it exceeds 5% for any significant segment.
Summary: The Quality Measurement System
Quality measurement for AI products is an ongoing engineering discipline, not a post-launch checkbox. The key elements of a production system are:
- Multi-signal architecture: combine explicit feedback, implicit behavioral signals, and LLM-as-judge. No single source is sufficient.
- Dense coverage: rely on implicit signals (copy rate, abandonment) as your primary dense signal layer because they cover 100% of interactions without user action.
- Deterministic sampling: hash-based sampling for LLM judge evaluation enables reliable before/after comparisons.
- Regression detection: automated comparison of key metrics between time windows, with statistical significance testing and auto-alerting at defined thresholds.
- Segmented measurement: segment quality by query category, user tier, prompt version, and model. Aggregate scores hide the problems you need to fix.
- Judge calibration: maintain a human-labeled golden eval dataset and measure judge-human correlation weekly. Do not trust judge scores without calibration.
- Composite scoring: a single composite quality score (explicit 40% + behavioral 30% + automated 30%) provides a stable, actionable health signal for engineering and product teams.
The infrastructure cost of this system - LLM judge calls at 10% sample rate - is typically 5–8% of total inference cost. The cost of not having it is building blind and discovering quality problems when users churn.
