Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the RAG Evaluation demo on the EngineersOfAI Playground - no code required. :::

RAG Evaluation and RAGAS

The Invisible Degradationโ€‹

The team shipped a RAG-powered customer support system with an 87% user satisfaction score in beta. Three months into production, a product manager pulled the weekly dashboard and noticed satisfaction had dropped to 71%. Sixteen percentage points in twelve weeks. The team had no idea what had changed.

They had no evaluation metrics running. No automatic tests on a golden question set. No monitoring of retrieval quality over time. They had measured satisfaction at launch, seen a good number, and moved on. The system was a black box - it worked or it did not, and the only signal was lagged user feedback that arrived days after the failures.

Two weeks of investigation later, the team found two independent issues. A document ingestion bug introduced three weeks earlier had corrupted 15% of their chunks - stripping metadata, truncating content at random byte boundaries, and introducing garbled text that would embed poorly and retrieve even more poorly. The system had been serving answers grounded in garbage for three weeks, and nobody had known. Separately, a vocabulary shift in new user queries - the company had launched a new product feature that users talked about using new terminology - meant that retrieval was systematically missing relevant documentation because the new terms did not match the embedding neighborhoods of the existing docs.

Both issues would have been caught within hours by a continuous evaluation pipeline. The document corruption would have shown up immediately as a collapse in context precision - the retrieved chunks would have scored low on relevance because garbled chunks could not be relevant to any query. The vocabulary shift would have shown as a decline in context recall - the system was retrieving contextually correct chunks, but missing the chunks that actually answered the question. Two numbers, two actionable root causes, caught immediately rather than three weeks after the fact.

RAG evaluation is not optional for production systems. It is the difference between deploying a system you can maintain and debug versus deploying a system whose failures are invisible until users leave. This lesson covers the RAGAS framework in production depth: the four core metrics, how each catches a distinct failure mode, how to implement them using Claude as an LLM judge, and how to build a continuous evaluation pipeline that catches regressions automatically.

The mathematics are accessible. The implementation is straightforward. The hard part is the discipline: running evaluation continuously, acting on metric changes, and resisting the temptation to tune metrics instead of fixing the underlying system.


Why RAG Evaluation Is Hardโ€‹

No Single Correct Answerโ€‹

In classification tasks, evaluation is simple: prediction matches label or it does not. RAG evaluation is harder because there is no single correct answer. "What are the causes of API timeouts?" has many valid answers of varying completeness. Partial answers are better than empty answers but worse than complete answers. Answers from different retrieved context are all valid but not equivalent. There is no ground truth label to compare against.

Hallucination Is Invisible Without Active Detectionโ€‹

A RAG system can generate a confident, well-structured, plausible-sounding answer that is completely ungrounded in its retrieved context. This is the hallucination problem. Without active verification - checking whether each claim in the generated answer is actually present in the retrieved chunks - hallucination is undetectable from the answer text alone.

Users often cannot detect hallucination either. A response that sounds authoritative about a technical topic will be accepted by users who do not independently know the answer. The user satisfaction signal does not reliably detect hallucination, especially in the short term.

Retrieval Quality and Generation Quality Are Independent Failure Modesโ€‹

A RAG system can fail in two independent ways: retrieving the wrong documents, or generating a wrong answer from correct documents. These require different fixes - retrieval failures require improving the retrieval pipeline, generation failures require improving the prompt or the LLM. Without separate retrieval metrics and generation metrics, you cannot tell which is failing.

The Coupled-but-Independent Problemโ€‹

Retrieval quality and generation quality are coupled - poor retrieval causes poor generation - but they can also fail independently. A system can have excellent retrieval (it finds the right documents) but poor generation (it hallucinates beyond what the documents say). Or it can have poor retrieval (it misses key documents) but passable generation (it generates something plausible from what little it has). User satisfaction conflates both failure modes. RAGAS separates them into orthogonal metrics.


Historical Contextโ€‹

RAG evaluation research accelerated alongside RAG deployment. Early practitioners used ad-hoc metrics: ROUGE for surface-level answer similarity, human evaluation for quality, and BM25 retrieval metrics like MAP and NDCG for retrieval quality. These approaches were expensive (human evaluation does not scale), slow (weekly evaluation cycles instead of continuous), and incomplete (ROUGE does not measure grounding).

RAGAS (Es et al., 2023) - "Ragas: Automated Evaluation of Retrieval Augmented Generation" - introduced a framework of four complementary metrics designed specifically for RAG evaluation. The key innovation was making two of the four metrics reference-free: faithfulness and context precision can be computed without a ground-truth answer, using the LLM itself as a judge. This made continuous production monitoring feasible - you do not need a human to label every production query to detect quality degradation.

The paper demonstrated strong correlation between RAGAS metrics and human evaluation scores across multiple domains, validating the LLM-as-judge approach for RAG quality assessment.

LLM-as-judge as a general evaluation paradigm was validated by Zheng et al. (Lmsys, 2023) in "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," which showed that GPT-4-level models can serve as reliable evaluators for open-ended generation quality with agreement rates comparable to human experts.

RAGAS has been widely adopted in production RAG systems as a standard evaluation baseline. The framework continues to evolve - RAGAS 2.0 added noise sensitivity, answer correctness, and context entity recall - but the four core metrics remain the primary diagnostic tools.


The Four RAGAS Metricsโ€‹

Each metric measures a different failure mode. Together they form a complete diagnostic picture of why a RAG system is failing.

Metric 1: Faithfulnessโ€‹

What it measures: Are all claims in the generated answer actually supported by the retrieved context?

The failure mode it catches: Hallucination. When the LLM generates information that is not present in - or contradicts - the retrieved chunks.

Why it matters: Faithfulness failures are the most dangerous RAG failure mode. A system with low faithfulness is generating confident misinformation. Users cannot detect this without independent verification. In medical, legal, or financial applications, faithfulness failures can cause real harm.

Algorithm:

  1. Extract all factual claims from the generated answer
  2. For each claim, check whether it is supported by any retrieved chunk
  3. Faithfulness = (supported claims) / (total claims)

Score interpretation:

  • 1.0: Every claim in the answer is grounded in retrieved context
  • 0.8-0.99: Mostly grounded, some minor unsupported additions
  • 0.5-0.8: Significant hallucination, answer goes substantially beyond context
  • Below 0.5: Severely unfaithful - answer is largely hallucinated

Reference-free: Yes. Faithfulness evaluation requires only the answer and the retrieved context - no ground truth answer needed.

Metric 2: Answer Relevanceโ€‹

What it measures: Does the generated answer actually address the question that was asked?

The failure mode it catches: Off-topic answers. When the system generates a high-quality, well-grounded response to a subtly different question than the one asked.

Why it matters: A RAG system can be highly faithful (all claims grounded) while producing an irrelevant answer - if it retrieved the wrong documents and generated an accurate description of those documents rather than an answer to the question. Answer relevance catches this semantic mismatch.

Algorithm (counterintuitive but elegant):

  1. Generate N questions from the answer text (using an LLM)
  2. Measure the semantic similarity between each generated question and the original question
  3. Answer Relevance = mean similarity score across N generated questions

The insight: if the answer is relevant to the question, then questions generated from that answer should resemble the original question. If the answer drifts from the question, the generated questions will also drift.

Score interpretation:

  • 1.0: Answer perfectly addresses the question
  • 0.8-0.99: Minor relevance drift, answer mostly addresses the question
  • 0.5-0.8: Partial relevance, answer addresses a related but different question
  • Below 0.5: Answer does not address the question

Reference-free: Yes. Requires only the question and the answer.

Metric 3: Context Precisionโ€‹

What it measures: Of the retrieved chunks, what fraction are actually relevant to the question?

The failure mode it catches: Retrieval noise. When the retriever returns plausible-sounding but actually irrelevant documents. These noisy chunks dilute the signal-to-noise ratio of the context window, making it harder for the LLM to generate accurate answers.

Why it matters: Context window real estate is limited. Noisy chunks take up space that could be used for relevant information. They also confuse the LLM - it may try to reconcile irrelevant information with the question, producing off-topic or hallucinated answers.

Algorithm:

  1. For each retrieved chunk at rank kk, assign a binary relevance label: is this chunk relevant to the question?
  2. Compute precision at each rank: P@k=relevantย chunksย inย topย kkP@k = \frac{\text{relevant chunks in top k}}{k}
  3. Context Precision = average precision (considers rank ordering, rewards precision-at-rank)

The rank-aware version rewards systems that put relevant chunks at the top of the ranked list. A system that retrieves the right chunk but ranks it 10th scores lower than one that ranks it 1st.

Score interpretation:

  • 1.0: All retrieved chunks are relevant to the question
  • 0.7-0.99: Most chunks relevant, some noise
  • 0.5-0.7: Half the retrieved context is noise
  • Below 0.5: More noise than signal - retrieval is fundamentally broken

Reference-free: Yes. Requires only the question and the retrieved chunks.

Metric 4: Context Recallโ€‹

What it measures: Did the retrieval system find all the information necessary to answer the question?

The failure mode it catches: Retrieval misses. When the relevant documents exist in the corpus but the retrieval system failed to find them. The generator cannot produce a complete answer from incomplete context, no matter how good it is.

Why it matters: Context recall measures whether your retrieval is finding what matters. High context recall means you are not leaving relevant information on the table. Low context recall means the answer will be incomplete even with a perfect generator.

Algorithm (requires ground truth answer):

  1. Extract factual statements from the ground truth answer
  2. For each statement, check whether it is supported by any retrieved chunk
  3. Context Recall = (statements supported by retrieved context) / (total statements in ground truth)

Score interpretation:

  • 1.0: All information needed to answer the question was retrieved
  • 0.8-0.99: Nearly all information retrieved, minor gaps
  • 0.5-0.8: Significant information missing from retrieved context
  • Below 0.5: Retrieval is missing most of what matters

Requires ground truth: Yes. Context recall requires a reference answer to measure against. This limits it to your evaluation dataset rather than production monitoring.


Metric Relationships: Diagnosing Failure Patternsโ€‹

FaithfulnessAnswer RelevanceContext PrecisionContext RecallDiagnosis
LowAnyAnyAnyLLM is hallucinating beyond retrieved context - fix prompt, model, or post-processing
HighLowAnyAnyAnswer is grounded but off-topic - retrieval found wrong domain, or question is ambiguous
HighHighLowHighRetrieval adds noise but still finds what matters - tighten reranking, reduce K
HighHighHighLowRetrieval misses key documents - fix embedding, chunking, or indexing
HighHighLowLowRetrieval fundamentally broken - noisy and missing simultaneously
All high---Healthy system - monitor for drift

Production Codeโ€‹

import anthropic
import asyncio
import json
import re
import statistics
from dataclasses import dataclass, field
from typing import Optional
import numpy as np

client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Data Structures
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

@dataclass
class RAGASDatapoint:
"""A single RAG evaluation data point."""
question: str
answer: str
context: list[str] # Retrieved chunks as raw text strings
ground_truth: Optional[str] = None # Reference answer (for context recall)
metadata: dict = field(default_factory=dict)


@dataclass
class RAGASMetrics:
"""Computed RAGAS metrics for a single datapoint."""
faithfulness: Optional[float] = None # 0-1, reference-free
answer_relevance: Optional[float] = None # 0-1, reference-free
context_precision: Optional[float] = None # 0-1, reference-free
context_recall: Optional[float] = None # 0-1, requires ground_truth

def overall_score(self) -> float:
"""Mean of all available metrics."""
available = [v for v in [
self.faithfulness, self.answer_relevance,
self.context_precision, self.context_recall
] if v is not None]
return statistics.mean(available) if available else 0.0

def to_dict(self) -> dict:
return {
"faithfulness": self.faithfulness,
"answer_relevance": self.answer_relevance,
"context_precision": self.context_precision,
"context_recall": self.context_recall,
"overall": self.overall_score(),
}


@dataclass
class EvaluationReport:
"""Aggregated evaluation report across a batch of datapoints."""
num_datapoints: int
mean_faithfulness: float
mean_answer_relevance: float
mean_context_precision: float
mean_context_recall: float
mean_overall: float
per_metric_std: dict
below_threshold: dict # metric โ†’ count of datapoints below threshold
timestamp: str = ""

def to_dict(self) -> dict:
return {
"n": self.num_datapoints,
"faithfulness": round(self.mean_faithfulness, 4),
"answer_relevance": round(self.mean_answer_relevance, 4),
"context_precision": round(self.mean_context_precision, 4),
"context_recall": round(self.mean_context_recall, 4),
"overall": round(self.mean_overall, 4),
"std": {k: round(v, 4) for k, v in self.per_metric_std.items()},
"below_threshold": self.below_threshold,
}


@dataclass
class RegressionAlert:
"""Alert generated when a metric drops significantly from baseline."""
metric: str
baseline_value: float
current_value: float
drop_magnitude: float
is_significant: bool
action_required: str


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Metric Thresholds for Alerting
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

METRIC_THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevance": 0.80,
"context_precision": 0.75,
"context_recall": 0.70,
}

REGRESSION_THRESHOLDS = {
"faithfulness": 0.05, # Alert if drops by 5 points
"answer_relevance": 0.05,
"context_precision": 0.08,
"context_recall": 0.08,
}


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Faithfulness Evaluator
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class FaithfulnessEvaluator:
"""
Measures whether all claims in the answer are supported by retrieved context.

Algorithm:
1. Extract all factual claims from the answer (haiku)
2. For each claim, verify whether it appears in the retrieved context (haiku)
3. Score = supported_claims / total_claims
"""

_EXTRACT_CLAIMS_PROMPT = """Extract all distinct factual claims from this answer.

A claim is a specific factual assertion that could be independently verified.
Return each claim as a single concise sentence.
Ignore claims about what the answer "does not know" or hedging language.

Answer: {answer}

Return ONLY a JSON array of claim strings. No preamble, no numbering.
Example: ["The API rate limit is 100 requests per minute", "Tokens expire after 3600 seconds"]

JSON array:"""

_VERIFY_CLAIM_PROMPT = """Does the following retrieved context support this specific claim?

Claim: {claim}

Retrieved context:
{context}

Respond with YES if the context explicitly supports this claim.
Respond with NO if the context does not contain information supporting this claim.

Respond with exactly one word: YES or NO"""

def extract_claims(self, answer: str) -> list[str]:
"""Extract factual claims from the answer using claude-haiku-4-5-20251001."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=600,
messages=[{
"role": "user",
"content": self._EXTRACT_CLAIMS_PROMPT.format(answer=answer)
}]
)
raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
return json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
# Fallback: split by newline, clean up
lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
return [l for l in lines if len(l) > 10]

def verify_claim(self, claim: str, context: list[str]) -> bool:
"""Check whether a claim is supported by any retrieved chunk."""
context_text = "\n\n---\n\n".join(context[:5]) # Limit to top 5 chunks
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._VERIFY_CLAIM_PROMPT.format(
claim=claim,
context=context_text[:2000],
)
}]
)
return message.content[0].text.strip().upper().startswith("YES")

async def verify_claim_async(self, claim: str, context: list[str]) -> bool:
"""Async version for parallel claim verification."""
context_text = "\n\n---\n\n".join(context[:5])
message = await async_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._VERIFY_CLAIM_PROMPT.format(
claim=claim,
context=context_text[:2000],
)
}]
)
return message.content[0].text.strip().upper().startswith("YES")

async def score_async(self, datapoint: RAGASDatapoint) -> float:
"""
Score faithfulness asynchronously - verify all claims in parallel.
Returns 0.0 if no claims extracted (defensive default).
"""
claims = self.extract_claims(datapoint.answer)
if not claims:
return 0.0

# Verify all claims in parallel
tasks = [
self.verify_claim_async(claim, datapoint.context)
for claim in claims
]
results = await asyncio.gather(*tasks)

supported = sum(1 for r in results if r)
return supported / len(claims)

def score(self, datapoint: RAGASDatapoint) -> float:
"""Synchronous faithfulness score (for single datapoints)."""
return asyncio.run(self.score_async(datapoint))


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Answer Relevance Evaluator
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class AnswerRelevanceEvaluator:
"""
Measures whether the answer actually addresses the question.

Algorithm (reverse generation):
1. Generate N questions from the answer text
2. Compute semantic similarity between each generated question and original question
3. Answer Relevance = mean similarity score

Insight: if the answer is relevant, questions generated from it
should closely resemble the original question. Drift in the generated
questions signals drift in the answer from the question.
"""

_GENERATE_QUESTIONS_PROMPT = """Given this answer, generate {n} different questions that this
answer could plausibly be answering.

The questions should be diverse - different phrasings, different angles -
but all questions should be legitimately answerable by the given answer.

Return ONLY a JSON array of question strings.

Answer: {answer}

JSON array of {n} questions:"""

def generate_questions_from_answer(self, answer: str, n: int = 3) -> list[str]:
"""Generate n questions from the answer - the reverse generation step."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{
"role": "user",
"content": self._GENERATE_QUESTIONS_PROMPT.format(
answer=answer[:1500],
n=n,
)
}]
)
raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
questions = json.loads(match.group() if match else raw)
return questions[:n]
except (json.JSONDecodeError, AttributeError):
lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
return [l for l in lines if '?' in l or len(l) > 15][:n]

def get_embedding(self, text: str) -> np.ndarray:
"""
Get embedding for a text string.
In production: call your embedding model (OpenAI, Cohere, etc.)
Here: use a deterministic mock based on word overlap for demonstration.
"""
# Mock: deterministic embedding from character-level hashing
# Replace with: openai_client.embeddings.create(input=text, model="text-embedding-3-small")
words = set(text.lower().split())
rng = np.random.RandomState(sum(ord(c) for c in text[:50]) % (2**31))
base = rng.randn(1536)
# Add word-based signal for realistic similarity behavior
word_signal = np.zeros(1536)
for w in list(words)[:20]:
word_hash = sum(ord(c) for c in w) % 1536
word_signal[word_hash] += 0.5
embedding = base + word_signal
# Normalize
norm = np.linalg.norm(embedding)
return (embedding / norm).astype(np.float32) if norm > 0 else embedding.astype(np.float32)

def cosine_similarity(self, v1: np.ndarray, v2: np.ndarray) -> float:
"""Cosine similarity between two normalized vectors."""
return float(np.dot(v1, v2))

def score(self, datapoint: RAGASDatapoint, n_questions: int = 3) -> float:
"""
Compute answer relevance score.

1. Generate n questions from the answer
2. Embed original question and each generated question
3. Return mean cosine similarity
"""
generated_questions = self.generate_questions_from_answer(datapoint.answer, n=n_questions)
if not generated_questions:
return 0.0

original_embedding = self.get_embedding(datapoint.question)
similarities = []

for gen_q in generated_questions:
gen_embedding = self.get_embedding(gen_q)
sim = self.cosine_similarity(original_embedding, gen_embedding)
similarities.append(sim)

# Convert from cosine similarity range [-1,1] to [0,1]
mean_sim = statistics.mean(similarities)
return max(0.0, min(1.0, (mean_sim + 1) / 2))


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Context Precision Evaluator
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class ContextPrecisionEvaluator:
"""
Measures the fraction of retrieved chunks that are relevant to the question.

Algorithm:
1. For each retrieved chunk at rank k, score binary relevance (relevant/not)
2. Compute average precision (AP) - rank-aware precision metric
3. Context Precision = AP score

AP rewards systems that place relevant chunks at the top of the ranked list.
"""

_RELEVANCE_PROMPT = """Is this retrieved passage relevant to answering the question?

Question: {question}
Passage: {passage}

A passage is relevant if it contains information that helps answer the question.
A passage is not relevant if it discusses related topics but does not help answer this specific question.

Respond with exactly one word: RELEVANT or NOT_RELEVANT"""

def score_chunk_relevance(self, question: str, chunk: str) -> bool:
"""Binary relevance classification for a single chunk."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._RELEVANCE_PROMPT.format(
question=question,
passage=chunk[:800],
)
}]
)
return "RELEVANT" in message.content[0].text.strip().upper()

async def score_chunk_relevance_async(self, question: str, chunk: str) -> bool:
"""Async version for parallel evaluation."""
message = await async_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._RELEVANCE_PROMPT.format(
question=question,
passage=chunk[:800],
)
}]
)
return "RELEVANT" in message.content[0].text.strip().upper()

def compute_average_precision(self, relevance_labels: list[bool]) -> float:
"""
Compute Average Precision from binary relevance labels.

AP = (1/R) * sum_{k=1}^{K} P@k * rel(k)

where R is total relevant chunks, P@k is precision at rank k,
and rel(k) is 1 if chunk at rank k is relevant.
"""
if not relevance_labels:
return 0.0

num_relevant = sum(relevance_labels)
if num_relevant == 0:
return 0.0

precision_at_k_sum = 0.0
running_relevant = 0

for k, is_relevant in enumerate(relevance_labels, start=1):
if is_relevant:
running_relevant += 1
precision_at_k = running_relevant / k
precision_at_k_sum += precision_at_k

return precision_at_k_sum / num_relevant

async def score_async(self, datapoint: RAGASDatapoint) -> float:
"""Score context precision asynchronously - evaluate all chunks in parallel."""
if not datapoint.context:
return 0.0

tasks = [
self.score_chunk_relevance_async(datapoint.question, chunk)
for chunk in datapoint.context
]
relevance_labels = list(await asyncio.gather(*tasks))
return self.compute_average_precision(relevance_labels)

def score(self, datapoint: RAGASDatapoint) -> float:
"""Synchronous context precision score."""
return asyncio.run(self.score_async(datapoint))


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Context Recall Evaluator
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class ContextRecallEvaluator:
"""
Measures whether the retrieval found all information needed to answer the question.

Requires ground_truth in the RAGASDatapoint.

Algorithm:
1. Extract factual statements from the ground truth answer
2. For each statement, check whether it appears in any retrieved chunk
3. Context Recall = supported_statements / total_statements
"""

_EXTRACT_STATEMENTS_PROMPT = """Extract all distinct factual statements from this reference answer.

Each statement should be a single, specific, independently verifiable fact.
Keep statements concise and precise.

Reference answer: {ground_truth}

Return ONLY a JSON array of statement strings.
Example: ["The rate limit is 100 requests per minute", "Tokens expire after 3600 seconds"]

JSON array:"""

_CHECK_STATEMENT_PROMPT = """Is this statement supported by the retrieved context?

Statement: {statement}

Retrieved context:
{context}

Respond YES if the retrieved context explicitly contains information supporting this statement.
Respond NO if this statement's information is absent from the retrieved context.

Respond with exactly one word: YES or NO"""

def extract_ground_truth_statements(self, ground_truth: str) -> list[str]:
"""Extract factual statements from the ground truth answer."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=600,
messages=[{
"role": "user",
"content": self._EXTRACT_STATEMENTS_PROMPT.format(ground_truth=ground_truth)
}]
)
raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
return json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
return [l for l in lines if len(l) > 10]

async def check_statement_async(self, statement: str, context: list[str]) -> bool:
"""Check whether a statement is supported by the retrieved context."""
context_text = "\n\n---\n\n".join(context[:6])
message = await async_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._CHECK_STATEMENT_PROMPT.format(
statement=statement,
context=context_text[:2000],
)
}]
)
return message.content[0].text.strip().upper().startswith("YES")

async def score_async(self, datapoint: RAGASDatapoint) -> Optional[float]:
"""Score context recall asynchronously. Returns None if no ground truth."""
if not datapoint.ground_truth:
return None

statements = self.extract_ground_truth_statements(datapoint.ground_truth)
if not statements:
return None

tasks = [
self.check_statement_async(stmt, datapoint.context)
for stmt in statements
]
results = list(await asyncio.gather(*tasks))
supported = sum(1 for r in results if r)
return supported / len(statements)

def score(self, datapoint: RAGASDatapoint) -> Optional[float]:
"""Synchronous context recall score."""
return asyncio.run(self.score_async(datapoint))


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# RAGAS Evaluator: Combines All Metrics
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class RAGASEvaluator:
"""
Orchestrates all four RAGAS metrics for comprehensive RAG evaluation.

Runs all metrics in parallel where possible. Uses haiku for all
sub-evaluations to minimize cost. Reference-free metrics run without
ground truth; context recall runs only when ground truth is available.
"""

def __init__(self):
self.faithfulness = FaithfulnessEvaluator()
self.answer_relevance = AnswerRelevanceEvaluator()
self.context_precision = ContextPrecisionEvaluator()
self.context_recall = ContextRecallEvaluator()

async def evaluate_async(self, datapoint: RAGASDatapoint) -> RAGASMetrics:
"""Evaluate a single datapoint - run independent metrics in parallel."""
# Reference-free metrics can run in parallel
faith_task = self.faithfulness.score_async(datapoint)
prec_task = self.context_precision.score_async(datapoint)
recall_task = self.context_recall.score_async(datapoint)

# Answer relevance uses synchronous embedding (not async)
# Run it as a coroutine to fit the gather pattern
async def answer_relevance_wrapper():
return self.answer_relevance.score(datapoint)

faith_score, prec_score, recall_score, rel_score = await asyncio.gather(
faith_task,
prec_task,
recall_task,
answer_relevance_wrapper(),
)

return RAGASMetrics(
faithfulness=faith_score,
answer_relevance=rel_score,
context_precision=prec_score,
context_recall=recall_score,
)

def evaluate(self, datapoint: RAGASDatapoint) -> RAGASMetrics:
"""Synchronous evaluation for a single datapoint."""
return asyncio.run(self.evaluate_async(datapoint))

async def evaluate_batch_async(
self,
datapoints: list[RAGASDatapoint],
max_concurrent: int = 5,
) -> list[RAGASMetrics]:
"""
Evaluate a batch of datapoints with controlled concurrency.

max_concurrent limits simultaneous API calls to avoid rate limiting.
"""
semaphore = asyncio.Semaphore(max_concurrent)

async def eval_with_semaphore(dp: RAGASDatapoint) -> RAGASMetrics:
async with semaphore:
return await self.evaluate_async(dp)

tasks = [eval_with_semaphore(dp) for dp in datapoints]
return list(await asyncio.gather(*tasks))

def evaluate_batch(
self,
datapoints: list[RAGASDatapoint],
max_concurrent: int = 5,
) -> list[RAGASMetrics]:
"""Synchronous batch evaluation."""
return asyncio.run(self.evaluate_batch_async(datapoints, max_concurrent))

def aggregate_report(
self,
results: list[RAGASMetrics],
datapoints: list[RAGASDatapoint],
) -> EvaluationReport:
"""
Aggregate individual metric scores into a summary report.
Computes means, standard deviations, and threshold violation counts.
"""
from datetime import datetime

faith_scores = [r.faithfulness for r in results if r.faithfulness is not None]
rel_scores = [r.answer_relevance for r in results if r.answer_relevance is not None]
prec_scores = [r.context_precision for r in results if r.context_precision is not None]
recall_scores = [r.context_recall for r in results if r.context_recall is not None]
overall_scores = [r.overall_score() for r in results]

def safe_mean(scores: list[float]) -> float:
return statistics.mean(scores) if scores else 0.0

def safe_std(scores: list[float]) -> float:
return statistics.stdev(scores) if len(scores) > 1 else 0.0

below_threshold = {}
for metric, threshold in METRIC_THRESHOLDS.items():
scores_map = {
"faithfulness": faith_scores,
"answer_relevance": rel_scores,
"context_precision": prec_scores,
"context_recall": recall_scores,
}
scores = scores_map[metric]
below_threshold[metric] = sum(1 for s in scores if s < threshold)

return EvaluationReport(
num_datapoints=len(results),
mean_faithfulness=safe_mean(faith_scores),
mean_answer_relevance=safe_mean(rel_scores),
mean_context_precision=safe_mean(prec_scores),
mean_context_recall=safe_mean(recall_scores),
mean_overall=safe_mean(overall_scores),
per_metric_std={
"faithfulness": safe_std(faith_scores),
"answer_relevance": safe_std(rel_scores),
"context_precision": safe_std(prec_scores),
"context_recall": safe_std(recall_scores),
},
below_threshold=below_threshold,
timestamp=datetime.utcnow().isoformat(),
)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Test Set Generator
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class TestSetGenerator:
"""
Generates a diverse evaluation test set from the document corpus.

Creates questions of different types to ensure comprehensive coverage:
- Factual: specific, directly answerable from a single chunk
- Multi-hop: requires combining information from multiple chunks
- Abstractive: requires synthesizing across the full topic
- Comparative: compares two concepts or options
"""

_GENERATE_QUESTIONS_PROMPT = """Given this document passage, generate {n} high-quality evaluation questions.

Generate a mix of question types:
- FACTUAL: asks for a specific fact directly stated in the passage
- ABSTRACTIVE: requires understanding and synthesis, not just copying
- MULTI_HOP: requires information from multiple parts of the passage

For each question, also provide the ground_truth answer from the passage.

Return a JSON array where each element is:
{{"question": "...", "ground_truth": "...", "type": "FACTUAL|ABSTRACTIVE|MULTI_HOP"}}

Passage: {passage}

JSON array of {n} questions with answers:"""

def generate_questions(
self,
corpus_chunks: list[str],
n_per_chunk: int = 3,
max_chunks: int = 50,
) -> list[RAGASDatapoint]:
"""
Generate evaluation questions from corpus chunks.

For a production test set: sample diverse chunks to cover the
full corpus, not just the most common topics.
"""
test_datapoints = []
selected_chunks = corpus_chunks[:max_chunks]

for chunk in selected_chunks:
if len(chunk) < 100:
continue # Skip very short chunks

message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=800,
messages=[{
"role": "user",
"content": self._GENERATE_QUESTIONS_PROMPT.format(
passage=chunk[:1500],
n=n_per_chunk,
)
}]
)

raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
questions_data = json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
continue

for item in questions_data:
if isinstance(item, dict) and "question" in item:
datapoint = RAGASDatapoint(
question=item.get("question", ""),
answer="", # To be filled by your RAG system
context=[], # To be filled by retrieval
ground_truth=item.get("ground_truth"),
metadata={
"question_type": item.get("type", "UNKNOWN"),
"source_chunk": chunk[:100],
}
)
test_datapoints.append(datapoint)

return test_datapoints

def generate_from_qa_pairs(
self,
qa_pairs: list[dict],
rag_system_fn,
) -> list[RAGASDatapoint]:
"""
Create RAGASDatapoints from (question, ground_truth) pairs by running
your RAG system to generate answers and retrieve context.

Args:
qa_pairs: list of {"question": "...", "ground_truth": "..."} dicts
rag_system_fn: callable that takes question โ†’ (answer, context_chunks)
"""
datapoints = []
for pair in qa_pairs:
question = pair["question"]
ground_truth = pair.get("ground_truth")

answer, context = rag_system_fn(question)
datapoints.append(RAGASDatapoint(
question=question,
answer=answer,
context=context,
ground_truth=ground_truth,
))

return datapoints


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Continuous Evaluation Pipeline
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

class ContinuousEvalPipeline:
"""
Production continuous evaluation pipeline.

Runs RAGAS evaluation on a sampled subset of production queries
and on a fixed golden test set. Detects regressions by comparing
current metrics to a stored baseline.

Deployment pattern:
- Run on a schedule (hourly/daily)
- Store results in a time-series database
- Alert on metric drops exceeding REGRESSION_THRESHOLDS
- Automatically create tickets for below-threshold metrics
"""

def __init__(self, rag_system_fn=None, baseline: Optional[EvaluationReport] = None):
"""
Args:
rag_system_fn: callable(question: str) โ†’ (answer: str, context: list[str])
baseline: stored baseline report for regression comparison
"""
self.evaluator = RAGASEvaluator()
self.rag_system_fn = rag_system_fn
self.baseline = baseline
self._production_query_log: list[dict] = [] # Mock production log

def _populate_mock_query_log(self):
"""Mock production query log for demonstration."""
self._production_query_log = [
{
"question": "What causes HTTP 504 gateway timeout errors?",
"answer": "HTTP 504 Gateway Timeout errors occur when the upstream server does not respond within the configured timeout window. Common causes include high server load, slow database queries blocking request processing, and network congestion between services. The default timeout threshold is typically 30 seconds.",
"context": [
"Gateway timeout errors (HTTP 504) occur when the upstream server fails to respond within the configured timeout window. Common causes include high server load, slow database queries, and network congestion.",
"Request timeout thresholds can be configured using the timeout_seconds parameter. For high-load scenarios, consider increasing to 60-120 seconds.",
],
"ground_truth": "HTTP 504 Gateway Timeout occurs when the upstream server fails to respond within the timeout window. Causes include high server load, slow database queries, and network congestion. Default timeout is 30 seconds.",
},
{
"question": "How do I configure connection pooling?",
"answer": "Connection pool settings are configured via the max_pool_size parameter, which defaults to 10. Under high load, when all connections are active, new requests queue. You should monitor pool_active and pool_idle metrics to detect pool saturation, and increase max_pool_size proportional to your concurrent request rate.",
"context": [
"Connection pool exhaustion is a frequent cause of slowdowns under load. Configure max_connections appropriately for your concurrency level. Monitor pool_active and pool_idle metrics to detect pool starvation.",
],
"ground_truth": "Configure connection pooling via max_pool_size (default 10). Monitor pool_active and pool_idle metrics. Increase max_pool_size proportional to concurrent request rate.",
},
{
"question": "What is the JWT token expiration time?",
"answer": "JWT tokens expire after 3600 seconds (1 hour). Refresh tokens remain valid for 30 days. When a token expires, the API returns HTTP 401 Unauthorized. Implement automatic token refresh in your client to avoid authentication errors.",
"context": [
"API authentication uses Bearer token scheme with JWT. Tokens expire after 3600 seconds. Refresh tokens are valid for 30 days. When a token expires, the API returns HTTP 401 Unauthorized.",
],
"ground_truth": "JWT tokens expire after 3600 seconds. Refresh tokens are valid for 30 days. Expired tokens return HTTP 401.",
},
]

def sample_production_queries(self, n: int = 50) -> list[RAGASDatapoint]:
"""
Sample n queries from the production query log.
In production: pull from your request logging system.
"""
self._populate_mock_query_log()

datapoints = []
for entry in self._production_query_log[:n]:
datapoints.append(RAGASDatapoint(
question=entry["question"],
answer=entry["answer"],
context=entry["context"],
ground_truth=entry.get("ground_truth"),
metadata={"source": "production_sample"},
))

return datapoints

def run_evaluation(
self,
datapoints: Optional[list[RAGASDatapoint]] = None,
n_sample: int = 50,
) -> EvaluationReport:
"""
Run RAGAS evaluation on provided datapoints or sampled production queries.
"""
if datapoints is None:
datapoints = self.sample_production_queries(n=n_sample)

print(f"Evaluating {len(datapoints)} datapoints...")
metrics_list = self.evaluator.evaluate_batch(datapoints, max_concurrent=5)
report = self.evaluator.aggregate_report(metrics_list, datapoints)

print(f"Evaluation complete:")
print(f" Faithfulness: {report.mean_faithfulness:.3f}")
print(f" Answer Relevance: {report.mean_answer_relevance:.3f}")
print(f" Context Precision: {report.mean_context_precision:.3f}")
print(f" Context Recall: {report.mean_context_recall:.3f}")
print(f" Overall: {report.mean_overall:.3f}")

return report

def detect_regression(
self,
current: EvaluationReport,
baseline: Optional[EvaluationReport] = None,
) -> list[RegressionAlert]:
"""
Compare current metrics to baseline. Generate alerts for significant drops.

Uses absolute threshold for detection - if metric drops by more than
REGRESSION_THRESHOLDS[metric], generate an alert.

For statistical significance: with N=50 samples and the observed
standard deviation, compute the required drop magnitude to achieve
p < 0.05 significance.
"""
baseline = baseline or self.baseline
if not baseline:
return []

alerts = []
metric_pairs = [
("faithfulness", current.mean_faithfulness, baseline.mean_faithfulness),
("answer_relevance", current.mean_answer_relevance, baseline.mean_answer_relevance),
("context_precision", current.mean_context_precision, baseline.mean_context_precision),
("context_recall", current.mean_context_recall, baseline.mean_context_recall),
]

for metric, current_val, baseline_val in metric_pairs:
drop = baseline_val - current_val
threshold = REGRESSION_THRESHOLDS.get(metric, 0.05)
is_significant = drop > threshold

if drop > 0: # Any drop is worth recording
action_map = {
"faithfulness": "Check for prompt changes or model drift causing hallucination",
"answer_relevance": "Check for query distribution shift or retrieval returning wrong-domain docs",
"context_precision": "Check for document ingestion issues, corpus corruption, or embedding degradation",
"context_recall": "Check for new vocabulary in user queries not covered by existing document terms",
}

alerts.append(RegressionAlert(
metric=metric,
baseline_value=baseline_val,
current_value=current_val,
drop_magnitude=drop,
is_significant=is_significant,
action_required=action_map.get(metric, "Investigate metric drop"),
))

# Sort by severity (largest significant drops first)
alerts.sort(key=lambda a: (a.is_significant, a.drop_magnitude), reverse=True)
return alerts

def generate_dashboard_data(self, report: EvaluationReport) -> dict:
"""
Format evaluation results for a monitoring dashboard.
Compatible with common time-series monitoring systems (Grafana, etc.)
"""
return {
"timestamp": report.timestamp,
"metrics": {
"faithfulness": {
"value": report.mean_faithfulness,
"std": report.per_metric_std.get("faithfulness", 0.0),
"threshold": METRIC_THRESHOLDS["faithfulness"],
"status": "ok" if report.mean_faithfulness >= METRIC_THRESHOLDS["faithfulness"] else "alert",
"below_threshold_count": report.below_threshold.get("faithfulness", 0),
},
"answer_relevance": {
"value": report.mean_answer_relevance,
"std": report.per_metric_std.get("answer_relevance", 0.0),
"threshold": METRIC_THRESHOLDS["answer_relevance"],
"status": "ok" if report.mean_answer_relevance >= METRIC_THRESHOLDS["answer_relevance"] else "alert",
"below_threshold_count": report.below_threshold.get("answer_relevance", 0),
},
"context_precision": {
"value": report.mean_context_precision,
"std": report.per_metric_std.get("context_precision", 0.0),
"threshold": METRIC_THRESHOLDS["context_precision"],
"status": "ok" if report.mean_context_precision >= METRIC_THRESHOLDS["context_precision"] else "alert",
"below_threshold_count": report.below_threshold.get("context_precision", 0),
},
"context_recall": {
"value": report.mean_context_recall,
"std": report.per_metric_std.get("context_recall", 0.0),
"threshold": METRIC_THRESHOLDS["context_recall"],
"status": "ok" if report.mean_context_recall >= METRIC_THRESHOLDS["context_recall"] else "alert",
"below_threshold_count": report.below_threshold.get("context_recall", 0),
},
},
"summary": {
"overall_score": report.mean_overall,
"n_evaluated": report.num_datapoints,
"any_below_threshold": any(
v > 0 for v in report.below_threshold.values()
),
},
}


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Demo
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def run_demo():
"""Demonstrate the full RAGAS evaluation pipeline."""
pipeline = ContinuousEvalPipeline()

# Sample production queries and evaluate
datapoints = pipeline.sample_production_queries(n=3)
print(f"Evaluating {len(datapoints)} production sample queries...\n")

report = pipeline.run_evaluation(datapoints)

# Check against thresholds
print("\nThreshold checks:")
for metric, threshold in METRIC_THRESHOLDS.items():
count = report.below_threshold.get(metric, 0)
if count > 0:
print(f" ALERT: {metric} - {count}/{report.num_datapoints} datapoints below {threshold}")
else:
print(f" OK: {metric}")

# Generate dashboard data
dashboard = pipeline.generate_dashboard_data(report)
print(f"\nDashboard data: {json.dumps(dashboard, indent=2)}")


if __name__ == "__main__":
run_demo()

RAGAS Metrics and What They Catchโ€‹


Continuous Evaluation Pipeline Architectureโ€‹


Metric Diagnostic Guideโ€‹


Production Engineering Notesโ€‹

Evaluation Cost at Scaleโ€‹

Running RAGAS on 50 production queries per evaluation cycle:

Per datapoint (1 question, 5 context chunks, 1 answer):
Faithfulness:
- Extract claims: ~200 tokens haiku = $0.000025
- Verify 5 claims: 5 ร— 50 tokens haiku = $0.000031
Answer Relevance:
- Generate 3 questions: ~300 tokens haiku = $0.000038
- Embedding calls: ~$0.00003 (OpenAI text-embedding-3-small)
Context Precision:
- Score 5 chunks: 5 ร— 50 tokens haiku = $0.000031
Context Recall:
- Extract statements: ~200 tokens haiku = $0.000025
- Check 5 statements: 5 ร— 50 tokens haiku = $0.000031
Total per datapoint: ~$0.000181

50 datapoints per cycle: ~$0.009 per evaluation run
Hourly evaluation: ~$0.22/day
Daily evaluation: ~$0.009/day

RAGAS is exceptionally cheap to run continuously. The evaluation cost is orders of magnitude less than the cost of a production incident caused by undetected quality degradation.

Statistical Significance for Regression Detectionโ€‹

A drop from 0.85 to 0.82 faithfulness could be noise or could be a real regression. With N=50 samples and typical faithfulness standard deviation of 0.12, the minimum detectable difference at p<0.05 with a one-tailed t-test is approximately:

ฮ”minโก=t0.05,98ร—ฯƒร—2/Nโ‰ˆ1.66ร—0.12ร—0.04โ‰ˆ0.040\Delta_{\min} = t_{0.05, 98} \times \sigma \times \sqrt{2/N} \approx 1.66 \times 0.12 \times \sqrt{0.04} \approx 0.040

This means drops larger than 0.04 are statistically significant at p<0.05 with N=50. The REGRESSION_THRESHOLDS in the code (0.05 for faithfulness) are calibrated to be slightly above this minimum detectable difference - avoiding false alerts from sampling noise while catching real regressions.

For higher sensitivity: increase N. For fewer false alarms: increase the threshold. The tradeoff is detection speed vs. alert precision.

Building the Golden Test Setโ€‹

A golden test set is a fixed set of (question, context, ground_truth) triples that does not change between evaluation runs. It provides a stable reference point:

Composition: 100-300 questions covering the full distribution of query types (factual, conceptual, procedural, comparative, troubleshooting) and the full topic coverage of your corpus.

Quality: Every (question, ground_truth) pair should be manually verified by a domain expert before inclusion. Test set quality directly determines evaluation quality.

Maintenance: Review and update the golden test set when:

  • New documents are added to the corpus
  • New query types are observed in production
  • A question's ground truth changes due to updated documentation

Stratification: Maintain separate sub-sets by query type and topic. This allows you to detect that faithfulness degraded only for technical procedural questions - pointing to a specific chunk type or a specific document section, rather than system-wide degradation.

Alert Thresholds by Use Caseโ€‹

Default METRIC_THRESHOLDS are appropriate for general customer support. Adjust for high-stakes domains:

# Customer support / general knowledge
THRESHOLDS_GENERAL = {
"faithfulness": 0.85,
"answer_relevance": 0.80,
"context_precision": 0.75,
"context_recall": 0.70,
}

# Medical / legal / financial - higher stakes, tighter thresholds
THRESHOLDS_HIGH_STAKES = {
"faithfulness": 0.95, # Near-zero hallucination tolerance
"answer_relevance": 0.90,
"context_precision": 0.85,
"context_recall": 0.85, # Cannot miss relevant information
}

# Internal tooling / developer documentation - slightly relaxed
THRESHOLDS_DEVELOPER = {
"faithfulness": 0.80,
"answer_relevance": 0.75,
"context_precision": 0.70,
"context_recall": 0.65,
}

:::tip Start with Faithfulness and Context Precision

If you are implementing RAGAS for the first time and need to prioritize, start with faithfulness and context precision. Faithfulness catches hallucination - the most dangerous failure mode. Context precision catches retrieval noise - the most common infrastructure failure (document ingestion bugs, corrupted chunks, embedding quality issues). Both are reference-free, so you do not need to build a labeled dataset to start.

Add answer relevance next (reference-free, catches query distribution shift). Add context recall last - it requires building the golden test set with ground truth answers, which takes time but provides the most complete picture of retrieval quality.

:::

:::warning LLM-as-Judge Bias: The Self-Evaluation Problem

Using Claude to evaluate Claude introduces evaluator bias. The same model family tends to agree with itself - if claude-opus-4-6 generated the answer, claude-haiku-4-5-20251001 may score it more generously than an independent human evaluator would.

Mitigation strategies:

  1. Use claim-level verification (faithfulness) rather than holistic quality judgment - more objective
  2. Periodically validate your RAGAS scores against human evaluation on a sample of datapoints
  3. Track correlation between RAGAS scores and downstream user satisfaction metrics - if they diverge, the evaluator is biased
  4. Consider using a different model family for evaluation than for generation when model diversity is available

:::

:::danger Metric Gaming: Optimizing Numbers Instead of Quality

As soon as a metric becomes a target, teams find ways to improve it without improving the underlying system. Common RAGAS gaming patterns:

  • Faithfulness gaming: shorten answers to reduce claims, making all claims verifiable but the answer incomplete
  • Context precision gaming: reduce K (number of retrieved chunks) to retrieve fewer but higher-precision chunks, degrading recall
  • Answer relevance gaming: make answers more generic so generated questions always resemble the original

Guard against gaming by:

  1. Tracking all metrics simultaneously - gaming one typically degrades another
  2. Tracking downstream user metrics alongside RAGAS metrics - gaming should not improve user satisfaction
  3. Auditing metric improvements with human evaluation - verify that a metric improvement reflects a real quality improvement
  4. Setting minimum thresholds for all metrics, not maximizing any single one

:::


Interview Q&Aโ€‹

Q1: What are the four RAGAS metrics, and which failure mode does each one catch?โ€‹

RAGAS defines four complementary metrics, each designed to catch a different RAG failure mode:

Faithfulness catches hallucination. It extracts factual claims from the generated answer and verifies each against the retrieved context. A low faithfulness score means the LLM is generating content that goes beyond - or contradicts - what the retrieved documents actually say. The algorithm is reference-free: it requires only the answer and the retrieved context, no ground truth.

Answer Relevance catches off-topic answers. It generates N questions from the answer using reverse generation, then measures how similar those questions are to the original question. If the answer drifts from the question, the reverse-generated questions will also drift. A low answer relevance score means the system retrieved plausible but wrong-domain documents and generated an accurate but irrelevant answer. Also reference-free.

Context Precision catches retrieval noise. It scores each retrieved chunk for binary relevance to the question and computes average precision (rank-aware). A low context precision score means the retriever is returning plausible-sounding but irrelevant chunks that dilute the context window and confuse the generator. Reference-free.

Context Recall catches retrieval misses. It extracts statements from the ground truth answer and checks whether each is supported by any retrieved chunk. A low context recall score means relevant documents exist in the corpus but were not retrieved. Requires ground truth.

Q2: Why is the Answer Relevance metric computed by reverse-generating questions rather than by directly comparing the answer to the question?โ€‹

Directly comparing the answer text to the question text for semantic similarity is unreliable. A relevant answer and an irrelevant answer can have very similar surface-level similarity to the question - both may contain question-related vocabulary - while differing fundamentally in whether they actually address the question.

The reverse generation approach is more discriminative. An answer that correctly addresses "What are the causes of API timeouts?" will generate questions like "Why do API timeouts occur?" and "What triggers gateway timeouts?" - close to the original. An answer that instead describes the API authentication flow will generate questions like "How does OAuth work with this API?" and "What is the JWT token structure?" - far from the original question.

The cosine similarity between the original question embedding and the reverse-generated question embeddings effectively measures how much the answer "talks about" the original question's topic, which is a better proxy for relevance than direct answer-question similarity.

Q3: When would Context Precision be high but Context Recall be low? What does this pattern indicate and how do you fix it?โ€‹

This pattern - high precision, low recall - means the retrieval system is finding some relevant documents, and finding them accurately (low noise), but missing other relevant documents that also exist in the corpus.

Concrete scenario: a question asks about three distinct aspects of a topic. The retriever retrieves K=5 chunks, all of which are relevant to one aspect (high precision). But the corpus contains chunks about the other two aspects that were not retrieved (low recall). The user gets an accurate but incomplete answer.

Root causes and fixes:

  • Narrow embedding neighborhood: The query embedding only retrieves chunks that are very similar to the query surface form. Fix: use HyDE or multi-query retrieval to cover more of the relevant embedding space.
  • Low K: Retrieving only 3-5 chunks may miss tail-relevant documents. Fix: increase K, then use a reranker to filter to the best 5 from a larger candidate set.
  • Vocabulary gap: Relevant documents exist but use different terminology. The query embedding does not land near them. Fix: add synonym expansion or step-back prompting to bridge the vocabulary gap.
  • Chunking artifacts: Relevant information is split across chunk boundaries and neither half chunk is individually relevant enough to be retrieved. Fix: re-chunk with larger overlap or use parent-child retrieval.

Q4: How would you build and maintain a golden test set for continuous RAG evaluation?โ€‹

A golden test set is a static, manually validated collection of (question, context, ground_truth) triples. Construction:

  1. Initial generation: Use Claude to generate diverse questions from representative corpus chunks, covering all major topic areas and question types (factual, conceptual, procedural, comparative). Generate more than you need - typically 2-3x your target size.

  2. Manual curation: Have domain experts review each (question, ground_truth) pair. Discard questions with ambiguous answers, outdated information, or low difficulty. Annotate question types and topic areas for stratified analysis.

  3. Coverage validation: Verify the test set covers the full distribution of production queries. Sample 100 real production queries and check that similar questions exist in the golden set. Gaps indicate under-represented topics.

  4. Maintenance: Review the golden set when: (a) new documents are added to the corpus (add questions for the new content), (b) existing documents are updated (verify ground truths are still correct), (c) a new query type is observed frequently in production (add representative questions for that type). Never edit the golden set to make metrics look better - it must evolve to reflect real knowledge updates, not metric optimization.

Size recommendations: 100 questions minimum for early evaluation. 300+ for reliable regression detection (sufficient statistical power to detect 0.05 drops with p<0.05). 1000+ for production systems with multiple sub-domains that need separate monitoring.

Q5: A faithfulness metric drops from 0.88 to 0.71 overnight. How do you diagnose the root cause?โ€‹

A 17-point drop overnight is almost certainly a system change, not natural drift. Systematic diagnostic approach:

Step 1: Scope the problem. Run faithfulness evaluation broken down by question type and topic area. If the drop is uniform across all types โ†’ system-wide issue. If it is concentrated in one topic โ†’ topic-specific issue (likely a document ingestion problem for that topic's documents).

Step 2: Check system changes. Review the deployment log for the previous 24 hours. Prompt changes, model version changes, and document ingestion runs are the most common causes. A prompt that became more "creative" will produce answers that go beyond what the context says.

Step 3: Inspect failing examples. Pull the 10 lowest-faithfulness datapoints from the evaluation run. Manually read the question, the answer, and the retrieved context. Is the LLM generating information that contradicts the context? Or information that is absent from the context? These have different causes - contradiction suggests a prompt issue; absent information suggests retrieval degradation that confused the generator.

Step 4: Check document ingestion. If the drop correlates with a document ingestion run, inspect the recently ingested chunks. Look for: truncated content (byte-boundary corruption), garbled encoding, incorrect metadata, or duplicate chunks with conflicting information. Run a sample of the affected chunks through the embedding + retrieval pipeline to verify they return coherently.

Step 5: Check context precision correlation. If context precision also dropped, the retrieval is returning lower-quality chunks, which the LLM is then forced to work with poorly - leading to hallucination to fill gaps. Fix the retrieval first; faithfulness may recover without changing the generation prompt.

Q6: How do you validate that RAGAS metrics are actually correlated with user satisfaction in your specific application?โ€‹

RAGAS metrics are validated on general benchmarks but their correlation with your specific application's user satisfaction must be verified empirically.

Concurrent measurement: For 4-6 weeks, run RAGAS evaluation on sampled queries and collect user satisfaction ratings (thumbs up/down, CSAT scores, task completion rates) on the same queries. Build a correlation matrix between each RAGAS metric and each user satisfaction signal.

Expected correlations: Faithfulness should correlate with user trust and with users' tendency to report misinformation. Context recall should correlate with answer completeness ratings. Answer relevance should correlate with task completion rates.

Calibration signals: If faithfulness scores 0.9 but users frequently report wrong answers, your evaluator is too permissive - tighten the claim verification prompt. If faithfulness scores 0.7 and users are happy, your evaluator may be too strict on hedging language or on claims that are implicitly supported by context.

Action: Run this validation quarterly. User query distributions shift, corpus content evolves, and the relationship between proxy metrics and ground truth satisfaction can drift. A metric that was well-calibrated six months ago may need recalibration after a significant product change.


Summaryโ€‹

RAG evaluation is an operational discipline, not a one-time check. The team that shipped without evaluation found a 16-point satisfaction drop three months later - with no idea what changed, spending two weeks diagnosing what continuous evaluation would have caught in hours.

The four RAGAS metrics cover all major RAG failure modes:

MetricFailure ModeReference-FreeFix Direction
FaithfulnessHallucinationYesTighten prompt, improve grounding instructions
Answer RelevanceOff-topic answersYesFix retrieval routing, address query distribution shift
Context PrecisionRetrieval noiseYesFix document ingestion, improve reranking, reduce K
Context RecallRetrieval missesNo (needs ground truth)Fix embeddings, chunking, query expansion

Deploy continuous evaluation from day one. Run on sampled production queries daily. Compare against a baseline. Alert on drops exceeding the regression threshold. Build the golden test set incrementally - start with 50 questions, grow to 300. Instrument with dashboards. Review alerts within 24 hours.

The cost of continuous RAGAS evaluation is approximately $0.009 per 50-query evaluation run. The cost of a month of silent degradation is measured in user churn, support tickets, and the engineering time to diagnose a problem that monitoring would have flagged on day one.

ยฉ 2026 EngineersOfAI. All rights reserved.