What is RAG evaluation?

Build a continuous RAG evaluation pipeline using the RAGAS framework - faithfulness, answer relevance, context precision, and context recall - with full production implementations using the Anthropic SDK and automated regression detection.

How does RAGAS work in practice?

RAG Evaluation and RAGAS covers RAG evaluation, RAGAS, faithfulness evaluation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/rag-engineering/rag-evaluation-ragas

What is the difference between RAG evaluation and faithfulness evaluation?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/rag-engineering/rag-evaluation-ragas

:::tip 🎮 Interactive Playground Visualize this concept: Try the RAG Evaluation demo on the EngineersOfAI Playground - no code required. :::

RAG Evaluation and RAGAS

The Invisible Degradation

The team shipped a RAG-powered customer support system with an 87% user satisfaction score in beta. Three months into production, a product manager pulled the weekly dashboard and noticed satisfaction had dropped to 71%. Sixteen percentage points in twelve weeks. The team had no idea what had changed.

They had no evaluation metrics running. No automatic tests on a golden question set. No monitoring of retrieval quality over time. They had measured satisfaction at launch, seen a good number, and moved on. The system was a black box - it worked or it did not, and the only signal was lagged user feedback that arrived days after the failures.

Two weeks of investigation later, the team found two independent issues. A document ingestion bug introduced three weeks earlier had corrupted 15% of their chunks - stripping metadata, truncating content at random byte boundaries, and introducing garbled text that would embed poorly and retrieve even more poorly. The system had been serving answers grounded in garbage for three weeks, and nobody had known. Separately, a vocabulary shift in new user queries - the company had launched a new product feature that users talked about using new terminology - meant that retrieval was systematically missing relevant documentation because the new terms did not match the embedding neighborhoods of the existing docs.

Both issues would have been caught within hours by a continuous evaluation pipeline. The document corruption would have shown up immediately as a collapse in context precision - the retrieved chunks would have scored low on relevance because garbled chunks could not be relevant to any query. The vocabulary shift would have shown as a decline in context recall - the system was retrieving contextually correct chunks, but missing the chunks that actually answered the question. Two numbers, two actionable root causes, caught immediately rather than three weeks after the fact.

RAG evaluation is not optional for production systems. It is the difference between deploying a system you can maintain and debug versus deploying a system whose failures are invisible until users leave. This lesson covers the RAGAS framework in production depth: the four core metrics, how each catches a distinct failure mode, how to implement them using Claude as an LLM judge, and how to build a continuous evaluation pipeline that catches regressions automatically.

The mathematics are accessible. The implementation is straightforward. The hard part is the discipline: running evaluation continuously, acting on metric changes, and resisting the temptation to tune metrics instead of fixing the underlying system.

Why RAG Evaluation Is Hard

No Single Correct Answer

In classification tasks, evaluation is simple: prediction matches label or it does not. RAG evaluation is harder because there is no single correct answer. "What are the causes of API timeouts?" has many valid answers of varying completeness. Partial answers are better than empty answers but worse than complete answers. Answers from different retrieved context are all valid but not equivalent. There is no ground truth label to compare against.

Hallucination Is Invisible Without Active Detection

A RAG system can generate a confident, well-structured, plausible-sounding answer that is completely ungrounded in its retrieved context. This is the hallucination problem. Without active verification - checking whether each claim in the generated answer is actually present in the retrieved chunks - hallucination is undetectable from the answer text alone.

Users often cannot detect hallucination either. A response that sounds authoritative about a technical topic will be accepted by users who do not independently know the answer. The user satisfaction signal does not reliably detect hallucination, especially in the short term.

Retrieval Quality and Generation Quality Are Independent Failure Modes

A RAG system can fail in two independent ways: retrieving the wrong documents, or generating a wrong answer from correct documents. These require different fixes - retrieval failures require improving the retrieval pipeline, generation failures require improving the prompt or the LLM. Without separate retrieval metrics and generation metrics, you cannot tell which is failing.

The Coupled-but-Independent Problem

Retrieval quality and generation quality are coupled - poor retrieval causes poor generation - but they can also fail independently. A system can have excellent retrieval (it finds the right documents) but poor generation (it hallucinates beyond what the documents say). Or it can have poor retrieval (it misses key documents) but passable generation (it generates something plausible from what little it has). User satisfaction conflates both failure modes. RAGAS separates them into orthogonal metrics.

Historical Context

RAG evaluation research accelerated alongside RAG deployment. Early practitioners used ad-hoc metrics: ROUGE for surface-level answer similarity, human evaluation for quality, and BM25 retrieval metrics like MAP and NDCG for retrieval quality. These approaches were expensive (human evaluation does not scale), slow (weekly evaluation cycles instead of continuous), and incomplete (ROUGE does not measure grounding).

RAGAS (Es et al., 2023) - "Ragas: Automated Evaluation of Retrieval Augmented Generation" - introduced a framework of four complementary metrics designed specifically for RAG evaluation. The key innovation was making two of the four metrics reference-free: faithfulness and context precision can be computed without a ground-truth answer, using the LLM itself as a judge. This made continuous production monitoring feasible - you do not need a human to label every production query to detect quality degradation.

The paper demonstrated strong correlation between RAGAS metrics and human evaluation scores across multiple domains, validating the LLM-as-judge approach for RAG quality assessment.

LLM-as-judge as a general evaluation paradigm was validated by Zheng et al. (Lmsys, 2023) in "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," which showed that GPT-4-level models can serve as reliable evaluators for open-ended generation quality with agreement rates comparable to human experts.

RAGAS has been widely adopted in production RAG systems as a standard evaluation baseline. The framework continues to evolve - RAGAS 2.0 added noise sensitivity, answer correctness, and context entity recall - but the four core metrics remain the primary diagnostic tools.

The Four RAGAS Metrics

Each metric measures a different failure mode. Together they form a complete diagnostic picture of why a RAG system is failing.

Metric 1: Faithfulness

What it measures: Are all claims in the generated answer actually supported by the retrieved context?

The failure mode it catches: Hallucination. When the LLM generates information that is not present in - or contradicts - the retrieved chunks.

Why it matters: Faithfulness failures are the most dangerous RAG failure mode. A system with low faithfulness is generating confident misinformation. Users cannot detect this without independent verification. In medical, legal, or financial applications, faithfulness failures can cause real harm.

Algorithm:

Extract all factual claims from the generated answer
For each claim, check whether it is supported by any retrieved chunk
Faithfulness = (supported claims) / (total claims)

Score interpretation:

1.0: Every claim in the answer is grounded in retrieved context
0.8-0.99: Mostly grounded, some minor unsupported additions
0.5-0.8: Significant hallucination, answer goes substantially beyond context
Below 0.5: Severely unfaithful - answer is largely hallucinated

Reference-free: Yes. Faithfulness evaluation requires only the answer and the retrieved context - no ground truth answer needed.

Metric 2: Answer Relevance

What it measures: Does the generated answer actually address the question that was asked?

The failure mode it catches: Off-topic answers. When the system generates a high-quality, well-grounded response to a subtly different question than the one asked.

Why it matters: A RAG system can be highly faithful (all claims grounded) while producing an irrelevant answer - if it retrieved the wrong documents and generated an accurate description of those documents rather than an answer to the question. Answer relevance catches this semantic mismatch.

Algorithm (counterintuitive but elegant):

Generate N questions from the answer text (using an LLM)
Measure the semantic similarity between each generated question and the original question
Answer Relevance = mean similarity score across N generated questions

The insight: if the answer is relevant to the question, then questions generated from that answer should resemble the original question. If the answer drifts from the question, the generated questions will also drift.

Score interpretation:

1.0: Answer perfectly addresses the question
0.8-0.99: Minor relevance drift, answer mostly addresses the question
0.5-0.8: Partial relevance, answer addresses a related but different question
Below 0.5: Answer does not address the question

Reference-free: Yes. Requires only the question and the answer.

Metric 3: Context Precision

What it measures: Of the retrieved chunks, what fraction are actually relevant to the question?

The failure mode it catches: Retrieval noise. When the retriever returns plausible-sounding but actually irrelevant documents. These noisy chunks dilute the signal-to-noise ratio of the context window, making it harder for the LLM to generate accurate answers.

Why it matters: Context window real estate is limited. Noisy chunks take up space that could be used for relevant information. They also confuse the LLM - it may try to reconcile irrelevant information with the question, producing off-topic or hallucinated answers.

Algorithm:

For each retrieved chunk at rank $k$ , assign a binary relevance label: is this chunk relevant to the question?
Compute precision at each rank: $P@k = \frac{\text{relevant chunks in top k}}{k}$
Context Precision = average precision (considers rank ordering, rewards precision-at-rank)

The rank-aware version rewards systems that put relevant chunks at the top of the ranked list. A system that retrieves the right chunk but ranks it 10th scores lower than one that ranks it 1st.

Score interpretation:

1.0: All retrieved chunks are relevant to the question
0.7-0.99: Most chunks relevant, some noise
0.5-0.7: Half the retrieved context is noise
Below 0.5: More noise than signal - retrieval is fundamentally broken

Reference-free: Yes. Requires only the question and the retrieved chunks.

Metric 4: Context Recall

What it measures: Did the retrieval system find all the information necessary to answer the question?

The failure mode it catches: Retrieval misses. When the relevant documents exist in the corpus but the retrieval system failed to find them. The generator cannot produce a complete answer from incomplete context, no matter how good it is.

Why it matters: Context recall measures whether your retrieval is finding what matters. High context recall means you are not leaving relevant information on the table. Low context recall means the answer will be incomplete even with a perfect generator.

Algorithm (requires ground truth answer):

Extract factual statements from the ground truth answer
For each statement, check whether it is supported by any retrieved chunk
Context Recall = (statements supported by retrieved context) / (total statements in ground truth)

Score interpretation:

1.0: All information needed to answer the question was retrieved
0.8-0.99: Nearly all information retrieved, minor gaps
0.5-0.8: Significant information missing from retrieved context
Below 0.5: Retrieval is missing most of what matters

Requires ground truth: Yes. Context recall requires a reference answer to measure against. This limits it to your evaluation dataset rather than production monitoring.

Metric Relationships: Diagnosing Failure Patterns

Faithfulness	Answer Relevance	Context Precision	Context Recall	Diagnosis
Low	Any	Any	Any	LLM is hallucinating beyond retrieved context - fix prompt, model, or post-processing
High	Low	Any	Any	Answer is grounded but off-topic - retrieval found wrong domain, or question is ambiguous
High	High	Low	High	Retrieval adds noise but still finds what matters - tighten reranking, reduce K
High	High	High	Low	Retrieval misses key documents - fix embedding, chunking, or indexing
High	High	Low	Low	Retrieval fundamentally broken - noisy and missing simultaneously
All high	-	-	-	Healthy system - monitor for drift

Production Code

import anthropic
import asyncio
import json
import re
import statistics
from dataclasses import dataclass, field
from typing import Optional
import numpy as np

client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()


# ─────────────────────────────────────────────
# Data Structures
# ─────────────────────────────────────────────

@dataclass
class RAGASDatapoint:
    """A single RAG evaluation data point."""
    question: str
    answer: str
    context: list[str]               # Retrieved chunks as raw text strings
    ground_truth: Optional[str] = None  # Reference answer (for context recall)
    metadata: dict = field(default_factory=dict)


@dataclass
class RAGASMetrics:
    """Computed RAGAS metrics for a single datapoint."""
    faithfulness: Optional[float] = None        # 0-1, reference-free
    answer_relevance: Optional[float] = None    # 0-1, reference-free
    context_precision: Optional[float] = None   # 0-1, reference-free
    context_recall: Optional[float] = None      # 0-1, requires ground_truth

    def overall_score(self) -> float:
        """Mean of all available metrics."""
        available = [v for v in [
            self.faithfulness, self.answer_relevance,
            self.context_precision, self.context_recall
        ] if v is not None]
        return statistics.mean(available) if available else 0.0

    def to_dict(self) -> dict:
        return {
            "faithfulness": self.faithfulness,
            "answer_relevance": self.answer_relevance,
            "context_precision": self.context_precision,
            "context_recall": self.context_recall,
            "overall": self.overall_score(),
        }


@dataclass
class EvaluationReport:
    """Aggregated evaluation report across a batch of datapoints."""
    num_datapoints: int
    mean_faithfulness: float
    mean_answer_relevance: float
    mean_context_precision: float
    mean_context_recall: float
    mean_overall: float
    per_metric_std: dict
    below_threshold: dict  # metric → count of datapoints below threshold
    timestamp: str = ""

    def to_dict(self) -> dict:
        return {
            "n": self.num_datapoints,
            "faithfulness": round(self.mean_faithfulness, 4),
            "answer_relevance": round(self.mean_answer_relevance, 4),
            "context_precision": round(self.mean_context_precision, 4),
            "context_recall": round(self.mean_context_recall, 4),
            "overall": round(self.mean_overall, 4),
            "std": {k: round(v, 4) for k, v in self.per_metric_std.items()},
            "below_threshold": self.below_threshold,
        }


@dataclass
class RegressionAlert:
    """Alert generated when a metric drops significantly from baseline."""
    metric: str
    baseline_value: float
    current_value: float
    drop_magnitude: float
    is_significant: bool
    action_required: str


# ─────────────────────────────────────────────
# Metric Thresholds for Alerting
# ─────────────────────────────────────────────

METRIC_THRESHOLDS = {
    "faithfulness": 0.85,
    "answer_relevance": 0.80,
    "context_precision": 0.75,
    "context_recall": 0.70,
}

REGRESSION_THRESHOLDS = {
    "faithfulness": 0.05,      # Alert if drops by 5 points
    "answer_relevance": 0.05,
    "context_precision": 0.08,
    "context_recall": 0.08,
}


# ─────────────────────────────────────────────
# Faithfulness Evaluator
# ─────────────────────────────────────────────

class FaithfulnessEvaluator:
    """
    Measures whether all claims in the answer are supported by retrieved context.

    Algorithm:
    1. Extract all factual claims from the answer (haiku)
    2. For each claim, verify whether it appears in the retrieved context (haiku)
    3. Score = supported_claims / total_claims
    """

    _EXTRACT_CLAIMS_PROMPT = """Extract all distinct factual claims from this answer.

A claim is a specific factual assertion that could be independently verified.
Return each claim as a single concise sentence.
Ignore claims about what the answer "does not know" or hedging language.

Answer: {answer}

Return ONLY a JSON array of claim strings. No preamble, no numbering.
Example: ["The API rate limit is 100 requests per minute", "Tokens expire after 3600 seconds"]

JSON array:"""

    _VERIFY_CLAIM_PROMPT = """Does the following retrieved context support this specific claim?

Claim: {claim}

Retrieved context:
{context}

Respond with YES if the context explicitly supports this claim.
Respond with NO if the context does not contain information supporting this claim.

Respond with exactly one word: YES or NO"""

    def extract_claims(self, answer: str) -> list[str]:
        """Extract factual claims from the answer using claude-haiku-4-5-20251001."""
        message = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=600,
            messages=[{
                "role": "user",
                "content": self._EXTRACT_CLAIMS_PROMPT.format(answer=answer)
            }]
        )
        raw = message.content[0].text.strip()
        try:
            match = re.search(r'\[.*?\]', raw, re.DOTALL)
            return json.loads(match.group() if match else raw)
        except (json.JSONDecodeError, AttributeError):
            # Fallback: split by newline, clean up
            lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
            return [l for l in lines if len(l) > 10]

    def verify_claim(self, claim: str, context: list[str]) -> bool:
        """Check whether a claim is supported by any retrieved chunk."""
        context_text = "\n\n---\n\n".join(context[:5])  # Limit to top 5 chunks
        message = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": self._VERIFY_CLAIM_PROMPT.format(
                    claim=claim,
                    context=context_text[:2000],
                )
            }]
        )
        return message.content[0].text.strip().upper().startswith("YES")

    async def verify_claim_async(self, claim: str, context: list[str]) -> bool:
        """Async version for parallel claim verification."""
        context_text = "\n\n---\n\n".join(context[:5])
        message = await async_client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": self._VERIFY_CLAIM_PROMPT.format(
                    claim=claim,
                    context=context_text[:2000],
                )
            }]
        )
        return message.content[0].text.strip().upper().startswith("YES")

    async def score_async(self, datapoint: RAGASDatapoint) -> float:
        """
        Score faithfulness asynchronously - verify all claims in parallel.
        Returns 0.0 if no claims extracted (defensive default).
        """
        claims = self.extract_claims(datapoint.answer)
        if not claims:
            return 0.0

        # Verify all claims in parallel
        tasks = [
            self.verify_claim_async(claim, datapoint.context)
            for claim in claims
        ]
        results = await asyncio.gather(*tasks)

        supported = sum(1 for r in results if r)
        return supported / len(claims)

    def score(self, datapoint: RAGASDatapoint) -> float:
        """Synchronous faithfulness score (for single datapoints)."""
        return asyncio.run(self.score_async(datapoint))


# ─────────────────────────────────────────────
# Answer Relevance Evaluator
# ─────────────────────────────────────────────

class AnswerRelevanceEvaluator:
    """
    Measures whether the answer actually addresses the question.

    Algorithm (reverse generation):
    1. Generate N questions from the answer text
    2. Compute semantic similarity between each generated question and original question
    3. Answer Relevance = mean similarity score

    Insight: if the answer is relevant, questions generated from it
    should closely resemble the original question. Drift in the generated
    questions signals drift in the answer from the question.
    """

    _GENERATE_QUESTIONS_PROMPT = """Given this answer, generate {n} different questions that this
answer could plausibly be answering.

The questions should be diverse - different phrasings, different angles -
but all questions should be legitimately answerable by the given answer.

Return ONLY a JSON array of question strings.

Answer: {answer}

JSON array of {n} questions:"""

    def generate_questions_from_answer(self, answer: str, n: int = 3) -> list[str]:
        """Generate n questions from the answer - the reverse generation step."""
        message = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=400,
            messages=[{
                "role": "user",
                "content": self._GENERATE_QUESTIONS_PROMPT.format(
                    answer=answer[:1500],
                    n=n,
                )
            }]
        )
        raw = message.content[0].text.strip()
        try:
            match = re.search(r'\[.*?\]', raw, re.DOTALL)
            questions = json.loads(match.group() if match else raw)
            return questions[:n]
        except (json.JSONDecodeError, AttributeError):
            lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
            return [l for l in lines if '?' in l or len(l) > 15][:n]

    def get_embedding(self, text: str) -> np.ndarray:
        """
        Get embedding for a text string.
        In production: call your embedding model (OpenAI, Cohere, etc.)
        Here: use a deterministic mock based on word overlap for demonstration.
        """
        # Mock: deterministic embedding from character-level hashing
        # Replace with: openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        words = set(text.lower().split())
        rng = np.random.RandomState(sum(ord(c) for c in text[:50]) % (2**31))
        base = rng.randn(1536)
        # Add word-based signal for realistic similarity behavior
        word_signal = np.zeros(1536)
        for w in list(words)[:20]:
            word_hash = sum(ord(c) for c in w) % 1536
            word_signal[word_hash] += 0.5
        embedding = base + word_signal
        # Normalize
        norm = np.linalg.norm(embedding)
        return (embedding / norm).astype(np.float32) if norm > 0 else embedding.astype(np.float32)

    def cosine_similarity(self, v1: np.ndarray, v2: np.ndarray) -> float:
        """Cosine similarity between two normalized vectors."""
        return float(np.dot(v1, v2))

    def score(self, datapoint: RAGASDatapoint, n_questions: int = 3) -> float:
        """
        Compute answer relevance score.

        1. Generate n questions from the answer
        2. Embed original question and each generated question
        3. Return mean cosine similarity
        """
        generated_questions = self.generate_questions_from_answer(datapoint.answer, n=n_questions)
        if not generated_questions:
            return 0.0

        original_embedding = self.get_embedding(datapoint.question)
        similarities = []

        for gen_q in generated_questions:
            gen_embedding = self.get_embedding(gen_q)
            sim = self.cosine_similarity(original_embedding, gen_embedding)
            similarities.append(sim)

        # Convert from cosine similarity range [-1,1] to [0,1]
        mean_sim = statistics.mean(similarities)
        return max(0.0, min(1.0, (mean_sim + 1) / 2))


# ─────────────────────────────────────────────
# Context Precision Evaluator
# ─────────────────────────────────────────────

class ContextPrecisionEvaluator:
    """
    Measures the fraction of retrieved chunks that are relevant to the question.

    Algorithm:
    1. For each retrieved chunk at rank k, score binary relevance (relevant/not)
    2. Compute average precision (AP) - rank-aware precision metric
    3. Context Precision = AP score

    AP rewards systems that place relevant chunks at the top of the ranked list.
    """

    _RELEVANCE_PROMPT = """Is this retrieved passage relevant to answering the question?

Question: {question}
Passage: {passage}

A passage is relevant if it contains information that helps answer the question.
A passage is not relevant if it discusses related topics but does not help answer this specific question.

Respond with exactly one word: RELEVANT or NOT_RELEVANT"""

    def score_chunk_relevance(self, question: str, chunk: str) -> bool:
        """Binary relevance classification for a single chunk."""
        message = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": self._RELEVANCE_PROMPT.format(
                    question=question,
                    passage=chunk[:800],
                )
            }]
        )
        return "RELEVANT" in message.content[0].text.strip().upper()

    async def score_chunk_relevance_async(self, question: str, chunk: str) -> bool:
        """Async version for parallel evaluation."""
        message = await async_client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": self._RELEVANCE_PROMPT.format(
                    question=question,
                    passage=chunk[:800],
                )
            }]
        )
        return "RELEVANT" in message.content[0].text.strip().upper()

    def compute_average_precision(self, relevance_labels: list[bool]) -> float:
        """
        Compute Average Precision from binary relevance labels.

        AP = (1/R) * sum_{k=1}^{K} P@k * rel(k)

        where R is total relevant chunks, P@k is precision at rank k,
        and rel(k) is 1 if chunk at rank k is relevant.
        """
        if not relevance_labels:
            return 0.0

        num_relevant = sum(relevance_labels)
        if num_relevant == 0:
            return 0.0

        precision_at_k_sum = 0.0
        running_relevant = 0

        for k, is_relevant in enumerate(relevance_labels, start=1):
            if is_relevant:
                running_relevant += 1
                precision_at_k = running_relevant / k
                precision_at_k_sum += precision_at_k

        return precision_at_k_sum / num_relevant

    async def score_async(self, datapoint: RAGASDatapoint) -> float:
        """Score context precision asynchronously - evaluate all chunks in parallel."""
        if not datapoint.context:
            return 0.0

        tasks = [
            self.score_chunk_relevance_async(datapoint.question, chunk)
            for chunk in datapoint.context
        ]
        relevance_labels = list(await asyncio.gather(*tasks))
        return self.compute_average_precision(relevance_labels)

    def score(self, datapoint: RAGASDatapoint) -> float:
        """Synchronous context precision score."""
        return asyncio.run(self.score_async(datapoint))


# ─────────────────────────────────────────────
# Context Recall Evaluator
# ─────────────────────────────────────────────

class ContextRecallEvaluator:
    """
    Measures whether the retrieval found all information needed to answer the question.

    Requires ground_truth in the RAGASDatapoint.

    Algorithm:
    1. Extract factual statements from the ground truth answer
    2. For each statement, check whether it appears in any retrieved chunk
    3. Context Recall = supported_statements / total_statements
    """

    _EXTRACT_STATEMENTS_PROMPT = """Extract all distinct factual statements from this reference answer.

Each statement should be a single, specific, independently verifiable fact.
Keep statements concise and precise.

Reference answer: {ground_truth}

Return ONLY a JSON array of statement strings.
Example: ["The rate limit is 100 requests per minute", "Tokens expire after 3600 seconds"]

JSON array:"""

    _CHECK_STATEMENT_PROMPT = """Is this statement supported by the retrieved context?

Statement: {statement}

Retrieved context:
{context}

Respond YES if the retrieved context explicitly contains information supporting this statement.
Respond NO if this statement's information is absent from the retrieved context.

Respond with exactly one word: YES or NO"""

    def extract_ground_truth_statements(self, ground_truth: str) -> list[str]:
        """Extract factual statements from the ground truth answer."""
        message = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=600,
            messages=[{
                "role": "user",
                "content": self._EXTRACT_STATEMENTS_PROMPT.format(ground_truth=ground_truth)
            }]
        )
        raw = message.content[0].text.strip()
        try:
            match = re.search(r'\[.*?\]', raw, re.DOTALL)
            return json.loads(match.group() if match else raw)
        except (json.JSONDecodeError, AttributeError):
            lines = [l.strip().strip('"').strip('-').strip() for l in raw.split('\n') if l.strip()]
            return [l for l in lines if len(l) > 10]

    async def check_statement_async(self, statement: str, context: list[str]) -> bool:
        """Check whether a statement is supported by the retrieved context."""
        context_text = "\n\n---\n\n".join(context[:6])
        message = await async_client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": self._CHECK_STATEMENT_PROMPT.format(
                    statement=statement,
                    context=context_text[:2000],
                )
            }]
        )
        return message.content[0].text.strip().upper().startswith("YES")

    async def score_async(self, datapoint: RAGASDatapoint) -> Optional[float]:
        """Score context recall asynchronously. Returns None if no ground truth."""
        if not datapoint.ground_truth:
            return None

        statements = self.extract_ground_truth_statements(datapoint.ground_truth)
        if not statements:
            return None

        tasks = [
            self.check_statement_async(stmt, datapoint.context)
            for stmt in statements
        ]
        results = list(await asyncio.gather(*tasks))
        supported = sum(1 for r in results if r)
        return supported / len(statements)

    def score(self, datapoint: RAGASDatapoint) -> Optional[float]:
        """Synchronous context recall score."""
        return asyncio.run(self.score_async(datapoint))


# ─────────────────────────────────────────────
# RAGAS Evaluator: Combines All Metrics
# ─────────────────────────────────────────────

class RAGASEvaluator:
    """
    Orchestrates all four RAGAS metrics for comprehensive RAG evaluation.

    Runs all metrics in parallel where possible. Uses haiku for all
    sub-evaluations to minimize cost. Reference-free metrics run without
    ground truth; context recall runs only when ground truth is available.
    """

    def __init__(self):
        self.faithfulness = FaithfulnessEvaluator()
        self.answer_relevance = AnswerRelevanceEvaluator()
        self.context_precision = ContextPrecisionEvaluator()
        self.context_recall = ContextRecallEvaluator()

    async def evaluate_async(self, datapoint: RAGASDatapoint) -> RAGASMetrics:
        """Evaluate a single datapoint - run independent metrics in parallel."""
        # Reference-free metrics can run in parallel
        faith_task = self.faithfulness.score_async(datapoint)
        prec_task = self.context_precision.score_async(datapoint)
        recall_task = self.context_recall.score_async(datapoint)

        # Answer relevance uses synchronous embedding (not async)
        # Run it as a coroutine to fit the gather pattern
        async def answer_relevance_wrapper():
            return self.answer_relevance.score(datapoint)

        faith_score, prec_score, recall_score, rel_score = await asyncio.gather(
            faith_task,
            prec_task,
            recall_task,
            answer_relevance_wrapper(),
        )

        return RAGASMetrics(
            faithfulness=faith_score,
            answer_relevance=rel_score,
            context_precision=prec_score,
            context_recall=recall_score,
        )

    def evaluate(self, datapoint: RAGASDatapoint) -> RAGASMetrics:
        """Synchronous evaluation for a single datapoint."""
        return asyncio.run(self.evaluate_async(datapoint))

    async def evaluate_batch_async(
        self,
        datapoints: list[RAGASDatapoint],
        max_concurrent: int = 5,
    ) -> list[RAGASMetrics]:
        """
        Evaluate a batch of datapoints with controlled concurrency.

        max_concurrent limits simultaneous API calls to avoid rate limiting.
        """
        semaphore = asyncio.Semaphore(max_concurrent)

        async def eval_with_semaphore(dp: RAGASDatapoint) -> RAGASMetrics:
            async with semaphore:
                return await self.evaluate_async(dp)

        tasks = [eval_with_semaphore(dp) for dp in datapoints]
        return list(await asyncio.gather(*tasks))

    def evaluate_batch(
        self,
        datapoints: list[RAGASDatapoint],
        max_concurrent: int = 5,
    ) -> list[RAGASMetrics]:
        """Synchronous batch evaluation."""
        return asyncio.run(self.evaluate_batch_async(datapoints, max_concurrent))

    def aggregate_report(
        self,
        results: list[RAGASMetrics],
        datapoints: list[RAGASDatapoint],
    ) -> EvaluationReport:
        """
        Aggregate individual metric scores into a summary report.
        Computes means, standard deviations, and threshold violation counts.
        """
        from datetime import datetime

        faith_scores = [r.faithfulness for r in results if r.faithfulness is not None]
        rel_scores = [r.answer_relevance for r in results if r.answer_relevance is not None]
        prec_scores = [r.context_precision for r in results if r.context_precision is not None]
        recall_scores = [r.context_recall for r in results if r.context_recall is not None]
        overall_scores = [r.overall_score() for r in results]

        def safe_mean(scores: list[float]) -> float:
            return statistics.mean(scores) if scores else 0.0

        def safe_std(scores: list[float]) -> float:
            return statistics.stdev(scores) if len(scores) > 1 else 0.0

        below_threshold = {}
        for metric, threshold in METRIC_THRESHOLDS.items():
            scores_map = {
                "faithfulness": faith_scores,
                "answer_relevance": rel_scores,
                "context_precision": prec_scores,
                "context_recall": recall_scores,
            }
            scores = scores_map[metric]
            below_threshold[metric] = sum(1 for s in scores if s < threshold)

        return EvaluationReport(
            num_datapoints=len(results),
            mean_faithfulness=safe_mean(faith_scores),
            mean_answer_relevance=safe_mean(rel_scores),
            mean_context_precision=safe_mean(prec_scores),
            mean_context_recall=safe_mean(recall_scores),
            mean_overall=safe_mean(overall_scores),
            per_metric_std={
                "faithfulness": safe_std(faith_scores),
                "answer_relevance": safe_std(rel_scores),
                "context_precision": safe_std(prec_scores),
                "context_recall": safe_std(recall_scores),
            },
            below_threshold=below_threshold,
            timestamp=datetime.utcnow().isoformat(),
        )


# ─────────────────────────────────────────────
# Test Set Generator
# ─────────────────────────────────────────────

class TestSetGenerator:
    """
    Generates a diverse evaluation test set from the document corpus.

    Creates questions of different types to ensure comprehensive coverage:
    - Factual: specific, directly answerable from a single chunk
    - Multi-hop: requires combining information from multiple chunks
    - Abstractive: requires synthesizing across the full topic
    - Comparative: compares two concepts or options
    """

    _GENERATE_QUESTIONS_PROMPT = """Given this document passage, generate {n} high-quality evaluation questions.

Generate a mix of question types:
- FACTUAL: asks for a specific fact directly stated in the passage
- ABSTRACTIVE: requires understanding and synthesis, not just copying
- MULTI_HOP: requires information from multiple parts of the passage

For each question, also provide the ground_truth answer from the passage.

Return a JSON array where each element is:
{{"question": "...", "ground_truth": "...", "type": "FACTUAL|ABSTRACTIVE|MULTI_HOP"}}

Passage: {passage}

JSON array of {n} questions with answers:"""

    def generate_questions(
        self,
        corpus_chunks: list[str],
        n_per_chunk: int = 3,
        max_chunks: int = 50,
    ) -> list[RAGASDatapoint]:
        """
        Generate evaluation questions from corpus chunks.

        For a production test set: sample diverse chunks to cover the
        full corpus, not just the most common topics.
        """
        test_datapoints = []
        selected_chunks = corpus_chunks[:max_chunks]

        for chunk in selected_chunks:
            if len(chunk) < 100:
                continue  # Skip very short chunks

            message = client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=800,
                messages=[{
                    "role": "user",
                    "content": self._GENERATE_QUESTIONS_PROMPT.format(
                        passage=chunk[:1500],
                        n=n_per_chunk,
                    )
                }]
            )

            raw = message.content[0].text.strip()
            try:
                match = re.search(r'\[.*?\]', raw, re.DOTALL)
                questions_data = json.loads(match.group() if match else raw)
            except (json.JSONDecodeError, AttributeError):
                continue

            for item in questions_data:
                if isinstance(item, dict) and "question" in item:
                    datapoint = RAGASDatapoint(
                        question=item.get("question", ""),
                        answer="",          # To be filled by your RAG system
                        context=[],         # To be filled by retrieval
                        ground_truth=item.get("ground_truth"),
                        metadata={
                            "question_type": item.get("type", "UNKNOWN"),
                            "source_chunk": chunk[:100],
                        }
                    )
                    test_datapoints.append(datapoint)

        return test_datapoints

    def generate_from_qa_pairs(
        self,
        qa_pairs: list[dict],
        rag_system_fn,
    ) -> list[RAGASDatapoint]:
        """
        Create RAGASDatapoints from (question, ground_truth) pairs by running
        your RAG system to generate answers and retrieve context.

        Args:
            qa_pairs: list of {"question": "...", "ground_truth": "..."} dicts
            rag_system_fn: callable that takes question → (answer, context_chunks)
        """
        datapoints = []
        for pair in qa_pairs:
            question = pair["question"]
            ground_truth = pair.get("ground_truth")

            answer, context = rag_system_fn(question)
            datapoints.append(RAGASDatapoint(
                question=question,
                answer=answer,
                context=context,
                ground_truth=ground_truth,
            ))

        return datapoints


# ─────────────────────────────────────────────
# Continuous Evaluation Pipeline
# ─────────────────────────────────────────────

class ContinuousEvalPipeline:
    """
    Production continuous evaluation pipeline.

    Runs RAGAS evaluation on a sampled subset of production queries
    and on a fixed golden test set. Detects regressions by comparing
    current metrics to a stored baseline.

    Deployment pattern:
    - Run on a schedule (hourly/daily)
    - Store results in a time-series database
    - Alert on metric drops exceeding REGRESSION_THRESHOLDS
    - Automatically create tickets for below-threshold metrics
    """

    def __init__(self, rag_system_fn=None, baseline: Optional[EvaluationReport] = None):
        """
        Args:
            rag_system_fn: callable(question: str) → (answer: str, context: list[str])
            baseline: stored baseline report for regression comparison
        """
        self.evaluator = RAGASEvaluator()
        self.rag_system_fn = rag_system_fn
        self.baseline = baseline
        self._production_query_log: list[dict] = []  # Mock production log

    def _populate_mock_query_log(self):
        """Mock production query log for demonstration."""
        self._production_query_log = [
            {
                "question": "What causes HTTP 504 gateway timeout errors?",
                "answer": "HTTP 504 Gateway Timeout errors occur when the upstream server does not respond within the configured timeout window. Common causes include high server load, slow database queries blocking request processing, and network congestion between services. The default timeout threshold is typically 30 seconds.",
                "context": [
                    "Gateway timeout errors (HTTP 504) occur when the upstream server fails to respond within the configured timeout window. Common causes include high server load, slow database queries, and network congestion.",
                    "Request timeout thresholds can be configured using the timeout_seconds parameter. For high-load scenarios, consider increasing to 60-120 seconds.",
                ],
                "ground_truth": "HTTP 504 Gateway Timeout occurs when the upstream server fails to respond within the timeout window. Causes include high server load, slow database queries, and network congestion. Default timeout is 30 seconds.",
            },
            {
                "question": "How do I configure connection pooling?",
                "answer": "Connection pool settings are configured via the max_pool_size parameter, which defaults to 10. Under high load, when all connections are active, new requests queue. You should monitor pool_active and pool_idle metrics to detect pool saturation, and increase max_pool_size proportional to your concurrent request rate.",
                "context": [
                    "Connection pool exhaustion is a frequent cause of slowdowns under load. Configure max_connections appropriately for your concurrency level. Monitor pool_active and pool_idle metrics to detect pool starvation.",
                ],
                "ground_truth": "Configure connection pooling via max_pool_size (default 10). Monitor pool_active and pool_idle metrics. Increase max_pool_size proportional to concurrent request rate.",
            },
            {
                "question": "What is the JWT token expiration time?",
                "answer": "JWT tokens expire after 3600 seconds (1 hour). Refresh tokens remain valid for 30 days. When a token expires, the API returns HTTP 401 Unauthorized. Implement automatic token refresh in your client to avoid authentication errors.",
                "context": [
                    "API authentication uses Bearer token scheme with JWT. Tokens expire after 3600 seconds. Refresh tokens are valid for 30 days. When a token expires, the API returns HTTP 401 Unauthorized.",
                ],
                "ground_truth": "JWT tokens expire after 3600 seconds. Refresh tokens are valid for 30 days. Expired tokens return HTTP 401.",
            },
        ]

    def sample_production_queries(self, n: int = 50) -> list[RAGASDatapoint]:
        """
        Sample n queries from the production query log.
        In production: pull from your request logging system.
        """
        self._populate_mock_query_log()

        datapoints = []
        for entry in self._production_query_log[:n]:
            datapoints.append(RAGASDatapoint(
                question=entry["question"],
                answer=entry["answer"],
                context=entry["context"],
                ground_truth=entry.get("ground_truth"),
                metadata={"source": "production_sample"},
            ))

        return datapoints

    def run_evaluation(
        self,
        datapoints: Optional[list[RAGASDatapoint]] = None,
        n_sample: int = 50,
    ) -> EvaluationReport:
        """
        Run RAGAS evaluation on provided datapoints or sampled production queries.
        """
        if datapoints is None:
            datapoints = self.sample_production_queries(n=n_sample)

        print(f"Evaluating {len(datapoints)} datapoints...")
        metrics_list = self.evaluator.evaluate_batch(datapoints, max_concurrent=5)
        report = self.evaluator.aggregate_report(metrics_list, datapoints)

        print(f"Evaluation complete:")
        print(f"  Faithfulness:      {report.mean_faithfulness:.3f}")
        print(f"  Answer Relevance:  {report.mean_answer_relevance:.3f}")
        print(f"  Context Precision: {report.mean_context_precision:.3f}")
        print(f"  Context Recall:    {report.mean_context_recall:.3f}")
        print(f"  Overall:           {report.mean_overall:.3f}")

        return report

    def detect_regression(
        self,
        current: EvaluationReport,
        baseline: Optional[EvaluationReport] = None,
    ) -> list[RegressionAlert]:
        """
        Compare current metrics to baseline. Generate alerts for significant drops.

        Uses absolute threshold for detection - if metric drops by more than
        REGRESSION_THRESHOLDS[metric], generate an alert.

        For statistical significance: with N=50 samples and the observed
        standard deviation, compute the required drop magnitude to achieve
        p < 0.05 significance.
        """
        baseline = baseline or self.baseline
        if not baseline:
            return []

        alerts = []
        metric_pairs = [
            ("faithfulness", current.mean_faithfulness, baseline.mean_faithfulness),
            ("answer_relevance", current.mean_answer_relevance, baseline.mean_answer_relevance),
            ("context_precision", current.mean_context_precision, baseline.mean_context_precision),
            ("context_recall", current.mean_context_recall, baseline.mean_context_recall),
        ]

        for metric, current_val, baseline_val in metric_pairs:
            drop = baseline_val - current_val
            threshold = REGRESSION_THRESHOLDS.get(metric, 0.05)
            is_significant = drop > threshold

            if drop > 0:  # Any drop is worth recording
                action_map = {
                    "faithfulness": "Check for prompt changes or model drift causing hallucination",
                    "answer_relevance": "Check for query distribution shift or retrieval returning wrong-domain docs",
                    "context_precision": "Check for document ingestion issues, corpus corruption, or embedding degradation",
                    "context_recall": "Check for new vocabulary in user queries not covered by existing document terms",
                }

                alerts.append(RegressionAlert(
                    metric=metric,
                    baseline_value=baseline_val,
                    current_value=current_val,
                    drop_magnitude=drop,
                    is_significant=is_significant,
                    action_required=action_map.get(metric, "Investigate metric drop"),
                ))

        # Sort by severity (largest significant drops first)
        alerts.sort(key=lambda a: (a.is_significant, a.drop_magnitude), reverse=True)
        return alerts

    def generate_dashboard_data(self, report: EvaluationReport) -> dict:
        """
        Format evaluation results for a monitoring dashboard.
        Compatible with common time-series monitoring systems (Grafana, etc.)
        """
        return {
            "timestamp": report.timestamp,
            "metrics": {
                "faithfulness": {
                    "value": report.mean_faithfulness,
                    "std": report.per_metric_std.get("faithfulness", 0.0),
                    "threshold": METRIC_THRESHOLDS["faithfulness"],
                    "status": "ok" if report.mean_faithfulness >= METRIC_THRESHOLDS["faithfulness"] else "alert",
                    "below_threshold_count": report.below_threshold.get("faithfulness", 0),
                },
                "answer_relevance": {
                    "value": report.mean_answer_relevance,
                    "std": report.per_metric_std.get("answer_relevance", 0.0),
                    "threshold": METRIC_THRESHOLDS["answer_relevance"],
                    "status": "ok" if report.mean_answer_relevance >= METRIC_THRESHOLDS["answer_relevance"] else "alert",
                    "below_threshold_count": report.below_threshold.get("answer_relevance", 0),
                },
                "context_precision": {
                    "value": report.mean_context_precision,
                    "std": report.per_metric_std.get("context_precision", 0.0),
                    "threshold": METRIC_THRESHOLDS["context_precision"],
                    "status": "ok" if report.mean_context_precision >= METRIC_THRESHOLDS["context_precision"] else "alert",
                    "below_threshold_count": report.below_threshold.get("context_precision", 0),
                },
                "context_recall": {
                    "value": report.mean_context_recall,
                    "std": report.per_metric_std.get("context_recall", 0.0),
                    "threshold": METRIC_THRESHOLDS["context_recall"],
                    "status": "ok" if report.mean_context_recall >= METRIC_THRESHOLDS["context_recall"] else "alert",
                    "below_threshold_count": report.below_threshold.get("context_recall", 0),
                },
            },
            "summary": {
                "overall_score": report.mean_overall,
                "n_evaluated": report.num_datapoints,
                "any_below_threshold": any(
                    v > 0 for v in report.below_threshold.values()
                ),
            },
        }


# ─────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────

def run_demo():
    """Demonstrate the full RAGAS evaluation pipeline."""
    pipeline = ContinuousEvalPipeline()

    # Sample production queries and evaluate
    datapoints = pipeline.sample_production_queries(n=3)
    print(f"Evaluating {len(datapoints)} production sample queries...\n")

    report = pipeline.run_evaluation(datapoints)

    # Check against thresholds
    print("\nThreshold checks:")
    for metric, threshold in METRIC_THRESHOLDS.items():
        count = report.below_threshold.get(metric, 0)
        if count > 0:
            print(f"  ALERT: {metric} - {count}/{report.num_datapoints} datapoints below {threshold}")
        else:
            print(f"  OK:    {metric}")

    # Generate dashboard data
    dashboard = pipeline.generate_dashboard_data(report)
    print(f"\nDashboard data: {json.dumps(dashboard, indent=2)}")


if __name__ == "__main__":
    run_demo()

RAGAS Metrics and What They Catch

Continuous Evaluation Pipeline Architecture

Metric Diagnostic Guide

Production Engineering Notes

Evaluation Cost at Scale

Running RAGAS on 50 production queries per evaluation cycle:

Per datapoint (1 question, 5 context chunks, 1 answer):
  Faithfulness:
    - Extract claims: ~200 tokens haiku = $0.000025
    - Verify 5 claims: 5 × 50 tokens haiku = $0.000031
  Answer Relevance:
    - Generate 3 questions: ~300 tokens haiku = $0.000038
    - Embedding calls: ~$0.00003 (OpenAI text-embedding-3-small)
  Context Precision:
    - Score 5 chunks: 5 × 50 tokens haiku = $0.000031
  Context Recall:
    - Extract statements: ~200 tokens haiku = $0.000025
    - Check 5 statements: 5 × 50 tokens haiku = $0.000031
  Total per datapoint: ~$0.000181

50 datapoints per cycle:  ~$0.009 per evaluation run
Hourly evaluation:        ~$0.22/day
Daily evaluation:         ~$0.009/day

RAGAS is exceptionally cheap to run continuously. The evaluation cost is orders of magnitude less than the cost of a production incident caused by undetected quality degradation.

Statistical Significance for Regression Detection

A drop from 0.85 to 0.82 faithfulness could be noise or could be a real regression. With N=50 samples and typical faithfulness standard deviation of 0.12, the minimum detectable difference at p<0.05 with a one-tailed t-test is approximately:

$\Delta_{\min} = t_{0.05, 98} \times \sigma \times \sqrt{2/N} \approx 1.66 \times 0.12 \times \sqrt{0.04} \approx 0.040$

This means drops larger than 0.04 are statistically significant at p<0.05 with N=50. The REGRESSION_THRESHOLDS in the code (0.05 for faithfulness) are calibrated to be slightly above this minimum detectable difference - avoiding false alerts from sampling noise while catching real regressions.

For higher sensitivity: increase N. For fewer false alarms: increase the threshold. The tradeoff is detection speed vs. alert precision.

Building the Golden Test Set

A golden test set is a fixed set of (question, context, ground_truth) triples that does not change between evaluation runs. It provides a stable reference point:

Composition: 100-300 questions covering the full distribution of query types (factual, conceptual, procedural, comparative, troubleshooting) and the full topic coverage of your corpus.

Quality: Every (question, ground_truth) pair should be manually verified by a domain expert before inclusion. Test set quality directly determines evaluation quality.

Maintenance: Review and update the golden test set when:

New documents are added to the corpus
New query types are observed in production
A question's ground truth changes due to updated documentation

Stratification: Maintain separate sub-sets by query type and topic. This allows you to detect that faithfulness degraded only for technical procedural questions - pointing to a specific chunk type or a specific document section, rather than system-wide degradation.

Alert Thresholds by Use Case

Default METRIC_THRESHOLDS are appropriate for general customer support. Adjust for high-stakes domains:

# Customer support / general knowledge
THRESHOLDS_GENERAL = {
    "faithfulness": 0.85,
    "answer_relevance": 0.80,
    "context_precision": 0.75,
    "context_recall": 0.70,
}

# Medical / legal / financial - higher stakes, tighter thresholds
THRESHOLDS_HIGH_STAKES = {
    "faithfulness": 0.95,        # Near-zero hallucination tolerance
    "answer_relevance": 0.90,
    "context_precision": 0.85,
    "context_recall": 0.85,      # Cannot miss relevant information
}

# Internal tooling / developer documentation - slightly relaxed
THRESHOLDS_DEVELOPER = {
    "faithfulness": 0.80,
    "answer_relevance": 0.75,
    "context_precision": 0.70,
    "context_recall": 0.65,
}

:::tip Start with Faithfulness and Context Precision

If you are implementing RAGAS for the first time and need to prioritize, start with faithfulness and context precision. Faithfulness catches hallucination - the most dangerous failure mode. Context precision catches retrieval noise - the most common infrastructure failure (document ingestion bugs, corrupted chunks, embedding quality issues). Both are reference-free, so you do not need to build a labeled dataset to start.

Add answer relevance next (reference-free, catches query distribution shift). Add context recall last - it requires building the golden test set with ground truth answers, which takes time but provides the most complete picture of retrieval quality.

:::

:::warning LLM-as-Judge Bias: The Self-Evaluation Problem

Using Claude to evaluate Claude introduces evaluator bias. The same model family tends to agree with itself - if claude-opus-4-6 generated the answer, claude-haiku-4-5-20251001 may score it more generously than an independent human evaluator would.

Mitigation strategies:

Use claim-level verification (faithfulness) rather than holistic quality judgment - more objective
Periodically validate your RAGAS scores against human evaluation on a sample of datapoints
Track correlation between RAGAS scores and downstream user satisfaction metrics - if they diverge, the evaluator is biased
Consider using a different model family for evaluation than for generation when model diversity is available

:::

:::danger Metric Gaming: Optimizing Numbers Instead of Quality

As soon as a metric becomes a target, teams find ways to improve it without improving the underlying system. Common RAGAS gaming patterns:

Faithfulness gaming: shorten answers to reduce claims, making all claims verifiable but the answer incomplete
Context precision gaming: reduce K (number of retrieved chunks) to retrieve fewer but higher-precision chunks, degrading recall
Answer relevance gaming: make answers more generic so generated questions always resemble the original

Guard against gaming by:

Tracking all metrics simultaneously - gaming one typically degrades another
Tracking downstream user metrics alongside RAGAS metrics - gaming should not improve user satisfaction
Auditing metric improvements with human evaluation - verify that a metric improvement reflects a real quality improvement
Setting minimum thresholds for all metrics, not maximizing any single one

:::

Interview Q&A

Q1: What are the four RAGAS metrics, and which failure mode does each one catch?

RAGAS defines four complementary metrics, each designed to catch a different RAG failure mode:

Faithfulness catches hallucination. It extracts factual claims from the generated answer and verifies each against the retrieved context. A low faithfulness score means the LLM is generating content that goes beyond - or contradicts - what the retrieved documents actually say. The algorithm is reference-free: it requires only the answer and the retrieved context, no ground truth.

Answer Relevance catches off-topic answers. It generates N questions from the answer using reverse generation, then measures how similar those questions are to the original question. If the answer drifts from the question, the reverse-generated questions will also drift. A low answer relevance score means the system retrieved plausible but wrong-domain documents and generated an accurate but irrelevant answer. Also reference-free.

Context Precision catches retrieval noise. It scores each retrieved chunk for binary relevance to the question and computes average precision (rank-aware). A low context precision score means the retriever is returning plausible-sounding but irrelevant chunks that dilute the context window and confuse the generator. Reference-free.

Context Recall catches retrieval misses. It extracts statements from the ground truth answer and checks whether each is supported by any retrieved chunk. A low context recall score means relevant documents exist in the corpus but were not retrieved. Requires ground truth.

Q2: Why is the Answer Relevance metric computed by reverse-generating questions rather than by directly comparing the answer to the question?

Directly comparing the answer text to the question text for semantic similarity is unreliable. A relevant answer and an irrelevant answer can have very similar surface-level similarity to the question - both may contain question-related vocabulary - while differing fundamentally in whether they actually address the question.

The reverse generation approach is more discriminative. An answer that correctly addresses "What are the causes of API timeouts?" will generate questions like "Why do API timeouts occur?" and "What triggers gateway timeouts?" - close to the original. An answer that instead describes the API authentication flow will generate questions like "How does OAuth work with this API?" and "What is the JWT token structure?" - far from the original question.

The cosine similarity between the original question embedding and the reverse-generated question embeddings effectively measures how much the answer "talks about" the original question's topic, which is a better proxy for relevance than direct answer-question similarity.

Q3: When would Context Precision be high but Context Recall be low? What does this pattern indicate and how do you fix it?

This pattern - high precision, low recall - means the retrieval system is finding some relevant documents, and finding them accurately (low noise), but missing other relevant documents that also exist in the corpus.

Concrete scenario: a question asks about three distinct aspects of a topic. The retriever retrieves K=5 chunks, all of which are relevant to one aspect (high precision). But the corpus contains chunks about the other two aspects that were not retrieved (low recall). The user gets an accurate but incomplete answer.

Root causes and fixes:

Narrow embedding neighborhood: The query embedding only retrieves chunks that are very similar to the query surface form. Fix: use HyDE or multi-query retrieval to cover more of the relevant embedding space.
Low K: Retrieving only 3-5 chunks may miss tail-relevant documents. Fix: increase K, then use a reranker to filter to the best 5 from a larger candidate set.
Vocabulary gap: Relevant documents exist but use different terminology. The query embedding does not land near them. Fix: add synonym expansion or step-back prompting to bridge the vocabulary gap.
Chunking artifacts: Relevant information is split across chunk boundaries and neither half chunk is individually relevant enough to be retrieved. Fix: re-chunk with larger overlap or use parent-child retrieval.

Q4: How would you build and maintain a golden test set for continuous RAG evaluation?

A golden test set is a static, manually validated collection of (question, context, ground_truth) triples. Construction:

Initial generation: Use Claude to generate diverse questions from representative corpus chunks, covering all major topic areas and question types (factual, conceptual, procedural, comparative). Generate more than you need - typically 2-3x your target size.
Manual curation: Have domain experts review each (question, ground_truth) pair. Discard questions with ambiguous answers, outdated information, or low difficulty. Annotate question types and topic areas for stratified analysis.
Coverage validation: Verify the test set covers the full distribution of production queries. Sample 100 real production queries and check that similar questions exist in the golden set. Gaps indicate under-represented topics.
Maintenance: Review the golden set when: (a) new documents are added to the corpus (add questions for the new content), (b) existing documents are updated (verify ground truths are still correct), (c) a new query type is observed frequently in production (add representative questions for that type). Never edit the golden set to make metrics look better - it must evolve to reflect real knowledge updates, not metric optimization.

Size recommendations: 100 questions minimum for early evaluation. 300+ for reliable regression detection (sufficient statistical power to detect 0.05 drops with p<0.05). 1000+ for production systems with multiple sub-domains that need separate monitoring.

Q5: A faithfulness metric drops from 0.88 to 0.71 overnight. How do you diagnose the root cause?

A 17-point drop overnight is almost certainly a system change, not natural drift. Systematic diagnostic approach:

Step 1: Scope the problem. Run faithfulness evaluation broken down by question type and topic area. If the drop is uniform across all types → system-wide issue. If it is concentrated in one topic → topic-specific issue (likely a document ingestion problem for that topic's documents).

Step 2: Check system changes. Review the deployment log for the previous 24 hours. Prompt changes, model version changes, and document ingestion runs are the most common causes. A prompt that became more "creative" will produce answers that go beyond what the context says.

Step 3: Inspect failing examples. Pull the 10 lowest-faithfulness datapoints from the evaluation run. Manually read the question, the answer, and the retrieved context. Is the LLM generating information that contradicts the context? Or information that is absent from the context? These have different causes - contradiction suggests a prompt issue; absent information suggests retrieval degradation that confused the generator.

Step 4: Check document ingestion. If the drop correlates with a document ingestion run, inspect the recently ingested chunks. Look for: truncated content (byte-boundary corruption), garbled encoding, incorrect metadata, or duplicate chunks with conflicting information. Run a sample of the affected chunks through the embedding + retrieval pipeline to verify they return coherently.

Step 5: Check context precision correlation. If context precision also dropped, the retrieval is returning lower-quality chunks, which the LLM is then forced to work with poorly - leading to hallucination to fill gaps. Fix the retrieval first; faithfulness may recover without changing the generation prompt.

Q6: How do you validate that RAGAS metrics are actually correlated with user satisfaction in your specific application?

RAGAS metrics are validated on general benchmarks but their correlation with your specific application's user satisfaction must be verified empirically.

Concurrent measurement: For 4-6 weeks, run RAGAS evaluation on sampled queries and collect user satisfaction ratings (thumbs up/down, CSAT scores, task completion rates) on the same queries. Build a correlation matrix between each RAGAS metric and each user satisfaction signal.

Expected correlations: Faithfulness should correlate with user trust and with users' tendency to report misinformation. Context recall should correlate with answer completeness ratings. Answer relevance should correlate with task completion rates.

Calibration signals: If faithfulness scores 0.9 but users frequently report wrong answers, your evaluator is too permissive - tighten the claim verification prompt. If faithfulness scores 0.7 and users are happy, your evaluator may be too strict on hedging language or on claims that are implicitly supported by context.

Action: Run this validation quarterly. User query distributions shift, corpus content evolves, and the relationship between proxy metrics and ground truth satisfaction can drift. A metric that was well-calibrated six months ago may need recalibration after a significant product change.

Summary

RAG evaluation is an operational discipline, not a one-time check. The team that shipped without evaluation found a 16-point satisfaction drop three months later - with no idea what changed, spending two weeks diagnosing what continuous evaluation would have caught in hours.

The four RAGAS metrics cover all major RAG failure modes:

Metric	Failure Mode	Reference-Free	Fix Direction
Faithfulness	Hallucination	Yes	Tighten prompt, improve grounding instructions
Answer Relevance	Off-topic answers	Yes	Fix retrieval routing, address query distribution shift
Context Precision	Retrieval noise	Yes	Fix document ingestion, improve reranking, reduce K
Context Recall	Retrieval misses	No (needs ground truth)	Fix embeddings, chunking, query expansion

Deploy continuous evaluation from day one. Run on sampled production queries daily. Compare against a baseline. Alert on drops exceeding the regression threshold. Build the golden test set incrementally - start with 50 questions, grow to 300. Instrument with dashboards. Review alerts within 24 hours.

The cost of continuous RAGAS evaluation is approximately $0.009 per 50-query evaluation run. The cost of a month of silent degradation is measured in user churn, support tickets, and the engineering time to diagnose a problem that monitoring would have flagged on day one.

The Invisible Degradation​

Why RAG Evaluation Is Hard​

No Single Correct Answer​

Hallucination Is Invisible Without Active Detection​

Retrieval Quality and Generation Quality Are Independent Failure Modes​

The Coupled-but-Independent Problem​

Historical Context​

The Four RAGAS Metrics​

Metric 1: Faithfulness​

Metric 2: Answer Relevance​

Metric 3: Context Precision​

Metric 4: Context Recall​

Metric Relationships: Diagnosing Failure Patterns​

Production Code​

RAGAS Metrics and What They Catch​

Continuous Evaluation Pipeline Architecture​

Metric Diagnostic Guide​

Production Engineering Notes​

Evaluation Cost at Scale​

Statistical Significance for Regression Detection​

Building the Golden Test Set​

Alert Thresholds by Use Case​

Interview Q&A​

Q1: What are the four RAGAS metrics, and which failure mode does each one catch?​

Q2: Why is the Answer Relevance metric computed by reverse-generating questions rather than by directly comparing the answer to the question?​

Q3: When would Context Precision be high but Context Recall be low? What does this pattern indicate and how do you fix it?​

Q4: How would you build and maintain a golden test set for continuous RAG evaluation?​

Q5: A faithfulness metric drops from 0.88 to 0.71 overnight. How do you diagnose the root cause?​

Q6: How do you validate that RAGAS metrics are actually correlated with user satisfaction in your specific application?​

Summary​

The Invisible Degradation

Why RAG Evaluation Is Hard

No Single Correct Answer

Hallucination Is Invisible Without Active Detection

Retrieval Quality and Generation Quality Are Independent Failure Modes

The Coupled-but-Independent Problem

Historical Context

The Four RAGAS Metrics

Metric 1: Faithfulness

Metric 2: Answer Relevance

Metric 3: Context Precision

Metric 4: Context Recall

Metric Relationships: Diagnosing Failure Patterns

Production Code

RAGAS Metrics and What They Catch

Continuous Evaluation Pipeline Architecture

Metric Diagnostic Guide

Production Engineering Notes

Evaluation Cost at Scale

Statistical Significance for Regression Detection

Building the Golden Test Set

Alert Thresholds by Use Case

Interview Q&A

Q1: What are the four RAGAS metrics, and which failure mode does each one catch?

Q2: Why is the Answer Relevance metric computed by reverse-generating questions rather than by directly comparing the answer to the question?

Q3: When would Context Precision be high but Context Recall be low? What does this pattern indicate and how do you fix it?

Q4: How would you build and maintain a golden test set for continuous RAG evaluation?

Q5: A faithfulness metric drops from 0.88 to 0.71 overnight. How do you diagnose the root cause?

Q6: How do you validate that RAGAS metrics are actually correlated with user satisfaction in your specific application?

Summary