What is RAG evaluation?

Build rigorous RAG evaluation with RAGAS, TruLens, LLM-as-judge, golden datasets, and production monitoring - measure faithfulness, relevance, and groundedness.

How does RAGAS work in practice?

RAG Evaluation covers RAG evaluation, RAGAS, faithfulness from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/rag-systems/rag-evaluation

What is the difference between RAG evaluation and faithfulness?

See the full breakdown at https://engineersofai.com/docs/llms/rag-systems/rag-evaluation

RAG Evaluation

The Silent Degradation Problem

Six months after shipping their RAG-powered customer support assistant, the engineering team noticed something in their satisfaction surveys: scores had drifted down 12 points over three months. No code changes. No model changes. No infrastructure changes. Just a gradual, invisible erosion.

The investigation took two weeks. The company's product documentation had grown from 200 to 800 pages as new features shipped. The chunk count in their vector index had grown 4x. But the index had been built once, using a retrieval quality validated on 200 pages, and never re-evaluated on the expanded corpus. Retrieval precision had dropped as the corpus grew denser - more topically-similar chunks meant more retrieval noise. The LLM was generating answers from increasingly irrelevant context, and the "I don't know" fallback wasn't triggering because the retrieved context always contained something plausible, just not the right thing.

The fix required a proper evaluation pipeline that could detect this drift. Which meant they finally had to build what they should have built before shipping: a systematic way to measure whether the RAG system was working.

This lesson covers how to build that evaluation pipeline - the metrics, the frameworks, the tooling, and the operational practices.

The Three Failure Modes of RAG Systems

RAG can fail at three distinct points, and each requires different metrics:

Retrieval failure: The retriever fetched chunks that don't contain the information needed to answer the question. The LLM gets bad context and can either hallucinate or correctly say it doesn't know.
Faithfulness failure: The retriever found the right chunks, but the LLM ignored them or added information from its parametric memory that contradicts or isn't supported by the context.
Answer relevancy failure: The system retrieved relevant context, the LLM faithfully used it, but the answer doesn't actually address what the user asked.

You need metrics for all three. A system can score well on retrieval but fail on faithfulness. A system can be perfectly faithful but generate answers that miss the point.

The RAG Triad

The RAG Triad (popularized by TruLens/TruEra) captures these three dimensions:

Answer Relevance: Does the answer address the question asked?
Context Relevance: Are the retrieved chunks relevant to the question?
Groundedness (Faithfulness): Is every claim in the answer supported by the retrieved context?

These three metrics, together, give a complete picture of system quality. A high-quality RAG system scores well on all three.

RAGAS: The Standard Evaluation Framework

RAGAS (RAG Assessment) - developed by Shahul ES et al. (2023) - is the most widely adopted framework for automated RAG evaluation. It uses LLMs to evaluate LLM systems (LLM-as-judge), enabling scalable evaluation without human annotation for every query.

Faithfulness

Faithfulness measures whether each claim in the generated answer is supported by the retrieved context. It's computed by:

Breaking the generated answer into individual factual claims using an LLM
For each claim, checking if it can be inferred from the retrieved context
Faithfulness = (claims supported by context) / (total claims in answer)

A faithfulness score of 1.0 means every claim in the answer is grounded in the retrieved context. A score of 0.6 means 40% of claims are potentially hallucinated.

Answer Relevancy

Measures whether the generated answer addresses the question. Computed by:

Generating multiple hypothetical questions that the answer would address
Measuring cosine similarity between the original question and these generated questions
High similarity = the answer addresses the original question

This is an indirect measure: a good answer generates hypothetical questions close to the original.

Context Precision

Measures whether the retrieved context chunks are actually relevant to answering the question. Computed by:

For each retrieved chunk, judge (via LLM) whether it's useful for answering the question
Context Precision = (relevant chunks in top positions) / (total retrieved chunks)
Weighted version rewards relevant chunks appearing at the top of the ranked list

Context Recall

Measures whether all the information needed to answer the question is present in the retrieved context. Requires a ground truth answer. Computed by:

Break the ground truth answer into individual claims
For each claim, check if it can be attributed to the retrieved context
Context Recall = (claims attributable to context) / (total claims in ground truth)

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the refund window for unused items?",
        "How long does express shipping take?",
    ],
    "answer": [
        "Unused items can be returned within 30 days of purchase.",
        "Express shipping delivers in 1-2 business days.",
    ],
    "contexts": [
        # List of retrieved chunks for each question
        [
            "Our refund policy allows returns within 30 days of purchase for unused items.",
            "Items must be in original packaging and unused condition."
        ],
        [
            "Standard shipping takes 3-5 business days.",
            "Express shipping takes 1-2 business days for all domestic orders.",
        ],
    ],
    # Ground truth answers (needed for context_recall)
    "ground_truth": [
        "Customers can return unused items within 30 days.",
        "Express shipping typically delivers within 1-2 business days.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.89,
#  'context_precision': 0.88, 'context_recall': 0.92}

# Per-question breakdown
result.to_pandas()

Configuring RAGAS with Specific Models

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Use a cheaper model for evaluation (gpt-4o-mini works well for RAGAS)
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
eval_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
)

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=eval_llm,
    embeddings=eval_embeddings,
)

Building a Full RAG Evaluation Pipeline

import json
from typing import List, Dict, Any
from openai import OpenAI
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

client = OpenAI()

class RAGEvaluator:
    """End-to-end RAG evaluation pipeline."""

    def __init__(self, rag_system, eval_llm_model: str = "gpt-4o-mini"):
        self.rag = rag_system
        self.eval_model = eval_llm_model

    def generate_answer(self, question: str) -> Dict[str, Any]:
        """Run the RAG system and capture intermediate results."""
        # Retrieve
        retrieved_docs = self.rag.retrieve(question, top_k=5)
        context_chunks = [doc["text"] for doc in retrieved_docs]

        # Generate
        context_str = "\n\n".join(context_chunks)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Answer based only on the provided context. "
                        "If context doesn't contain the answer, say so."
                    )
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context_str}\n\nQuestion: {question}"
                }
            ],
            temperature=0,
        )
        answer = response.choices[0].message.content

        return {
            "question": question,
            "answer": answer,
            "contexts": context_chunks,
        }

    def evaluate_on_testset(self, test_cases: List[Dict]) -> Dict:
        """
        test_cases: list of {'question': str, 'ground_truth': str}
        Returns RAGAS metrics.
        """
        results = []
        for tc in test_cases:
            result = self.generate_answer(tc["question"])
            result["ground_truth"] = tc["ground_truth"]
            results.append(result)

        # Build RAGAS dataset
        dataset = Dataset.from_dict({
            "question": [r["question"] for r in results],
            "answer": [r["answer"] for r in results],
            "contexts": [r["contexts"] for r in results],
            "ground_truth": [r["ground_truth"] for r in results],
        })

        ragas_result = evaluate(
            dataset=dataset,
            metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
        )

        return {
            "metrics": dict(ragas_result),
            "details": results,
        }

    def diagnostic_report(self, test_cases: List[Dict]) -> str:
        """Generate a human-readable diagnostic report."""
        eval_result = self.evaluate_on_testset(test_cases)
        metrics = eval_result["metrics"]

        report_lines = [
            "=== RAG Evaluation Report ===",
            f"Questions evaluated: {len(test_cases)}",
            "",
            "RAGAS Metrics:",
            f"  Faithfulness:       {metrics.get('faithfulness', 0):.3f}",
            f"  Answer Relevancy:   {metrics.get('answer_relevancy', 0):.3f}",
            f"  Context Precision:  {metrics.get('context_precision', 0):.3f}",
            f"  Context Recall:     {metrics.get('context_recall', 0):.3f}",
            "",
        ]

        # Identify failing questions
        df = Dataset.from_dict({
            "question": [r["question"] for r in eval_result["details"]],
            "faithfulness": [0.0] * len(eval_result["details"]),  # placeholder
        }).to_pandas()

        report_lines.append("Diagnosis:")
        if metrics.get("context_recall", 1.0) < 0.7:
            report_lines.append("  - LOW RECALL: Retriever is missing relevant chunks.")
            report_lines.append("    → Check chunking strategy, embedding model, or index quality.")
        if metrics.get("context_precision", 1.0) < 0.7:
            report_lines.append("  - LOW PRECISION: Too many irrelevant chunks retrieved.")
            report_lines.append("    → Reduce top_k, add reranking, or improve metadata filtering.")
        if metrics.get("faithfulness", 1.0) < 0.8:
            report_lines.append("  - LOW FAITHFULNESS: LLM adding content not in context.")
            report_lines.append("    → Strengthen system prompt, reduce LLM temperature, add citation requirement.")
        if metrics.get("answer_relevancy", 1.0) < 0.75:
            report_lines.append("  - LOW RELEVANCY: Answers not addressing questions.")
            report_lines.append("    → Review generation prompt, consider query transformation.")

        return "\n".join(report_lines)

LLM-as-Judge: Building Custom Evaluators

RAGAS metrics are powerful but sometimes you need custom evaluation for domain-specific requirements. LLM-as-judge lets you define arbitrary evaluation criteria:

from openai import OpenAI
import json
from typing import Literal

client = OpenAI()

def judge_faithfulness(
    question: str,
    answer: str,
    context: str,
    model: str = "gpt-4o-mini"
) -> Dict[str, Any]:
    """Custom faithfulness judge."""
    prompt = f"""You are evaluating whether an AI assistant's answer is faithful to provided context.

Context:
{context}

Question: {question}
Answer: {answer}

Evaluate faithfulness: Does the answer contain ONLY information that can be found in or reasonably inferred from the context? No external knowledge should be added.

Respond with JSON only:
{{
  "faithful": true/false,
  "score": 0-10,
  "unfaithful_claims": ["claim1 that isn't in context", ...],
  "reasoning": "brief explanation"
}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)


def judge_completeness(
    question: str,
    answer: str,
    ground_truth: str,
    model: str = "gpt-4o-mini"
) -> Dict[str, Any]:
    """Judge whether answer covers all key points from ground truth."""
    prompt = f"""Compare an AI answer to the ground truth for completeness.

Question: {question}
Ground Truth: {ground_truth}
AI Answer: {answer}

Does the AI answer cover all key information from the ground truth?

Respond with JSON only:
{{
  "score": 0-10,
  "missing_points": ["point1 from ground truth not in answer", ...],
  "extra_points": ["information in answer not in ground truth"],
  "reasoning": "brief explanation"
}}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)


# Usage
result = judge_faithfulness(
    question="What is the refund window?",
    answer="Customers can return items within 30 days. Express returns take 2 days.",
    context="Unused items can be returned within 30 days of purchase."
)
print(result)
# {"faithful": false, "score": 5,
#  "unfaithful_claims": ["Express returns take 2 days"],
#  "reasoning": "The context doesn't mention express returns"}

Building a Golden Dataset

A golden dataset is a curated set of (question, ground truth answer, relevant document IDs) triples. It's the foundation of your offline evaluation. Without it, you can't measure improvements or regressions.

How to build it:

Method 1: Human annotation (highest quality, expensive)

Sample 200-500 representative queries from your user logs (or hypothesize realistic queries)
Have domain experts write ground truth answers for each query
Have a second expert mark which document chunks contain the relevant information
Use this as your eval set

Method 2: LLM generation + human review (practical at scale)

def generate_golden_questions(
    documents: List[Dict[str, str]],
    n_per_document: int = 3,
    model: str = "gpt-4o",
) -> List[Dict]:
    """Generate QA pairs from documents using an LLM."""
    golden_pairs = []

    for doc in documents:
        prompt = f"""Generate {n_per_document} realistic questions that this document excerpt answers.
For each question, provide the ground truth answer extractable from the document.

Document:
{doc['text']}

Respond with JSON:
{{
  "pairs": [
    {{"question": "...", "answer": "...", "answer_span": "exact quote from document"}},
    ...
  ]
}}"""

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            response_format={"type": "json_object"},
        )

        result = json.loads(response.choices[0].message.content)
        for pair in result.get("pairs", []):
            golden_pairs.append({
                "question": pair["question"],
                "ground_truth": pair["answer"],
                "source_doc_id": doc["id"],
                "answer_span": pair.get("answer_span", ""),
            })

    return golden_pairs

# After generation, human review: read 20-30 pairs, remove bad ones
# Mark which pairs are actually answerable from the document
# Remove duplicates and trivially easy questions

Golden dataset size guidelines:

System Scale	Minimum Eval Set	Recommended
MVP/prototype	50 questions	100 questions
Production v1	200 questions	500 questions
Mature system	500 questions	1000+ questions
Multi-domain	100+ per domain	300+ per domain

Online Evaluation: Production Monitoring

Offline eval tells you if the system works. Online eval tells you if it continues to work.

User feedback signals:

Thumbs up/down on answers
"Was this helpful?" binary feedback
Follow-up clarifying questions (signals the original answer was insufficient)
Answer copy events (user copied the answer to use it - strong positive signal)
Conversation abandonment (user gave up - strong negative signal)

Automated production metrics:

import logging
from datetime import datetime
from typing import Optional

class RAGMonitor:
    """Log RAG pipeline metrics to your observability stack."""

    def __init__(self, logger=None):
        self.logger = logger or logging.getLogger("rag_monitor")

    def log_query(
        self,
        query_id: str,
        question: str,
        retrieved_chunks: List[Dict],
        answer: str,
        latency_ms: float,
        model: str,
        user_id: Optional[str] = None,
    ):
        # Log retrieval metrics
        self.logger.info("rag_query", extra={
            "query_id": query_id,
            "question_length": len(question.split()),
            "num_retrieved": len(retrieved_chunks),
            "avg_retrieval_score": sum(c["score"] for c in retrieved_chunks) / len(retrieved_chunks),
            "min_retrieval_score": min(c["score"] for c in retrieved_chunks),
            "answer_length": len(answer.split()),
            "latency_ms": latency_ms,
            "model": model,
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
        })

    def log_feedback(
        self,
        query_id: str,
        thumbs_up: bool,
        feedback_text: Optional[str] = None,
    ):
        self.logger.info("rag_feedback", extra={
            "query_id": query_id,
            "thumbs_up": thumbs_up,
            "has_feedback_text": bool(feedback_text),
            "timestamp": datetime.utcnow().isoformat(),
        })

    def compute_drift_alert(self, recent_scores: List[float], baseline_mean: float) -> bool:
        """Alert if recent retrieval scores drift below baseline."""
        if len(recent_scores) < 100:
            return False
        recent_mean = sum(recent_scores) / len(recent_scores)
        return recent_mean < baseline_mean * 0.9  # 10% degradation threshold

Key production metrics to track:

Metric	How to Measure	Alert Threshold
Average retrieval score	Mean cosine similarity of retrieved chunks	Drop >10% from baseline
"No answer" rate	% of responses saying "I don't know"	Rise >2x from baseline
Response length trend	Mean word count of answers	Drop >20% may indicate insufficient context
Thumbs-down rate	User negative feedback / total queries	Rise >5 percentage points
P95 latency	95th percentile end-to-end latency	Exceed SLA threshold

TruLens: Alternative Evaluation Framework

TruLens (by TruEra, now Snowflake) is an alternative to RAGAS that provides similar metrics with a different interface and deeper LangChain integration:

from trulens_eval import TruLlm, Feedback, Tru
from trulens_eval.feedback.provider import OpenAI as TruOpenAI

tru = Tru()
openai_provider = TruOpenAI()

# Define feedback functions
f_groundedness = (
    Feedback(openai_provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruLlm.select_context())  # uses retrieved context
    .on_output()                   # evaluates final answer
    .aggregate(np.mean)
)

f_answer_relevance = (
    Feedback(openai_provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input()                    # uses question
    .on_output()                   # evaluates answer
)

f_context_relevance = (
    Feedback(openai_provider.qs_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()                    # uses question
    .on(TruLlm.select_context())  # evaluates each retrieved chunk
    .aggregate(np.mean)
)

# Wrap your RAG chain with TruLens instrumentation
tru_rag = TruLlm(
    your_rag_chain,
    app_id="my-rag-v1",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance]
)

# Run queries through the instrumented chain
with tru_rag as recording:
    response = your_rag_chain.query("What is the refund policy?")

# View results in TruLens dashboard
tru.run_dashboard()  # http://localhost:8501

NDCG and Retrieval-Specific Metrics

For evaluating the retrieval component alone (separate from generation), classical IR metrics are useful:

Recall@k: Fraction of relevant documents that appear in top-k results. $\text{Recall}@k = \frac{|\text{relevant} \cap \text{top-}k|}{|\text{relevant}|}$

Precision@k: Fraction of top-k results that are relevant. $\text{Precision}@k = \frac{|\text{relevant} \cap \text{top-}k|}{k}$

NDCG@k (Normalized Discounted Cumulative Gain): Rewards relevant results appearing higher in the ranking.

import numpy as np

def ndcg_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    """Compute NDCG@k."""
    def dcg(ranked: List[str], relevant: set, k: int) -> float:
        gain = 0.0
        for i, doc_id in enumerate(ranked[:k]):
            if doc_id in relevant:
                gain += 1.0 / np.log2(i + 2)  # log2(rank + 1), 1-indexed
        return gain

    ideal = sorted([1 if d in set(relevant_ids) else 0 for d in retrieved_ids],
                   reverse=True)[:k]
    idcg = sum(v / np.log2(i + 2) for i, v in enumerate(ideal))

    if idcg == 0:
        return 0.0

    actual_dcg = dcg(retrieved_ids, set(relevant_ids), k)
    return actual_dcg / idcg

def recall_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    relevant = set(relevant_ids)
    retrieved_top_k = set(retrieved_ids[:k])
    return len(relevant & retrieved_top_k) / len(relevant) if relevant else 0.0


# Evaluate retrieval on your golden dataset
def evaluate_retrieval(retriever, golden_dataset: List[Dict]) -> Dict:
    metrics = {"ndcg@5": [], "recall@5": [], "recall@10": []}

    for item in golden_dataset:
        retrieved = retriever.search(item["question"], top_k=10)
        retrieved_ids = [r["id"] for r in retrieved]
        relevant_ids = item["relevant_doc_ids"]  # from golden dataset

        metrics["ndcg@5"].append(ndcg_at_k(retrieved_ids, relevant_ids, k=5))
        metrics["recall@5"].append(recall_at_k(retrieved_ids, relevant_ids, k=5))
        metrics["recall@10"].append(recall_at_k(retrieved_ids, relevant_ids, k=10))

    return {k: np.mean(v) for k, v in metrics.items()}

Evaluation-Driven Development for RAG

The right development workflow: evaluate first, then optimize.

Production: When to Re-Evaluate

Trigger re-evaluation when:

Adding more than 20% new content to the knowledge base
Switching embedding model, chunk size, or retrieval algorithm
Upgrading the generation LLM
Online metrics show 10%+ degradation in user satisfaction
Monthly scheduled re-evaluation (detect slow drift)

Cost of RAGAS evaluation: At $0.15/1M input tokens (GPT-4o-mini) with 500 test questions averaging 1500 tokens each (question + context + answer), evaluation costs roughly$ 0.11 per run. At $0.11 per run, you can afford to run daily evaluations.

Common Mistakes

:::danger Evaluating Only Generation, Not Retrieval Many teams measure whether LLM answers are good (faithfulness, answer relevancy) without measuring whether retrieval is good (context precision, context recall). You can have perfect generation on top of terrible retrieval - the LLM correctly says "I don't have information about that" because nothing relevant was retrieved. Measure retrieval separately. If context recall is below 0.7, fix retrieval before optimizing generation. :::

:::danger Using the Same LLM for Generation and Evaluation Using GPT-4o to generate answers and GPT-4o to evaluate faithfulness creates sycophancy bias - the judge tends to rate its own outputs higher. Use different models (different sizes or different providers) for generation and evaluation. If you only have OpenAI access, at minimum use gpt-4o for generation and gpt-4o-mini for evaluation. :::

:::warning Golden Dataset Contamination If you generate your golden questions from the same documents that are in your index, the evaluation is biased toward your current chunking strategy. Questions generated from full documents may not be answerable from individual chunks. Generate golden questions from raw documents before chunking, and verify they're answerable from the chunked index - not just from the original document. :::

:::warning Not Segmenting Eval by Query Type A single aggregate faithfulness score of 0.85 hides important variation. Your system may be perfect on simple factual queries (faithfulness 0.98) but terrible on complex multi-hop questions (faithfulness 0.60). Always segment eval results by query type (simple factual, comparison, multi-hop, hypothetical) to understand where your system actually needs improvement. :::

Interview Questions and Answers

Q: What are the four RAGAS metrics and what does each measure?

A: Faithfulness: whether every claim in the generated answer is supported by the retrieved context. Computed by extracting claims from the answer and checking each against the context. Catches hallucinations. Answer Relevancy: whether the answer addresses the question. Computed by generating hypothetical questions from the answer and measuring similarity to the original question. Catches answers that are technically accurate but don't address what was asked. Context Precision: whether the retrieved chunks are relevant to answering the question. Computed by judging relevance of each retrieved chunk. Catches retrieval noise - irrelevant chunks that dilute the context. Context Recall: whether all the information needed to answer the question is in the retrieved context. Computed by checking if each claim in the ground truth answer is attributable to the retrieved context. Catches retrieval misses - when relevant documents aren't retrieved.

Q: You see that faithfulness is 0.65 on your RAG system. What are the most likely root causes and how do you investigate?

A: Faithfulness of 0.65 means 35% of claims in answers aren't supported by retrieved context. Root causes in order of likelihood: (1) System prompt not strong enough - the LLM is defaulting to its parametric memory when context is ambiguous. Fix: add explicit instructions like "Answer ONLY from the provided context. If the context doesn't contain the answer, say 'I don't have information about this.'" (2) Irrelevant context confuses the model - retrieved chunks contain some relevant text but also unrelated content. The model uses both. Fix: improve retrieval precision, reduce top-k, add reranking. (3) Low context recall - the right information isn't in the retrieved context, so the LLM fills in with parametric memory to appear helpful. Fix: improve retrieval recall (chunking, embedding, index). Investigation: manually inspect 20-30 low-faithfulness examples. Identify which of these patterns is dominant, then prioritize fixes accordingly.

Q: How do you build a golden dataset when you don't have labeled data?

A: Four approaches depending on resources. Best: domain expert annotation - have 2-3 domain experts each write 50-100 questions and answers based on representative documents. Practical: LLM-generated questions with human review - use GPT-4 to generate realistic questions from document passages, then have one person review 50 randomly sampled pairs (takes 2-3 hours). This scales cheaply. Synthetic from production logs: if you have user queries from a previous system, sample 200-300 and have a domain expert write ground truth answers for each. This has highest real-world validity. Automated with answer verification: generate questions, run them through your RAG system, have an LLM judge whether the generated answer matches a reference answer from the raw document. The last approach is lowest quality but fastest. In practice: start with 100 LLM-generated + human-reviewed pairs. Expand to 500+ over time using production queries as they accumulate.

Q: What's the difference between offline evaluation and online evaluation for RAG, and what does each tell you?

A: Offline evaluation runs your system against a fixed golden dataset on a schedule. It tells you: whether a specific system change improved or degraded quality, how your system performs on your representative query distribution, and whether your retrieval and generation components meet your quality bar. It does not tell you: how users actually behave, what edge cases arise in production, or how quality evolves as your knowledge base changes. Online evaluation instruments the live system and captures real-user signals: thumbs up/down, follow-up questions, abandonment rates, answer copy events. It tells you how the system actually performs in the wild and detects drift. The limitation: online signals are lagged (you need volume to detect degradation), sparse (users rarely leave feedback), and biased (users with bad experiences may leave more feedback). The right practice: use offline eval to gating changes before deployment, and online monitoring to detect production drift between deployments.

Q: How do you detect retrieval quality drift in production?

A: Several automated signals: (1) Monitor the distribution of retrieval similarity scores. If average top-1 cosine similarity drops from 0.78 to 0.70, your queries are getting worse matches - maybe new query patterns or corpus drift. Alert at 10% degradation. (2) Track "no information" response rate - when the LLM correctly says it can't answer because context is inadequate. A sudden increase signals retrieval degradation. (3) Monitor context precision using a lightweight LLM judge on a random sample of production queries - sample 1% of queries, evaluate context relevance, flag if it drops below threshold. (4) Run your golden dataset evaluation on a monthly schedule - the golden dataset doesn't change, so you can detect absolute degradation over time. (5) Correlate with corpus growth events - when you add new documents, re-run offline eval immediately to catch any chunking or indexing issues before they affect users.

Eval-Driven Development Workflow

The most effective teams treat evaluation as the driver of engineering decisions, not an afterthought. The workflow:

This cycle runs continuously. Every engineering change - new chunking strategy, different embedding model, added reranking, system prompt update - is gated by its effect on the golden set. Changes that don't improve at least one eval metric without degrading others don't ship.

Automating Eval in CI/CD

Eval should run automatically, not just when someone remembers:

# .github/workflows/rag-eval.yml equivalent (pseudo-code)
# Real implementation: CI step that calls this script

import json
from datasets import load_dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

def run_ci_eval(rag_pipeline, golden_dataset_path: str, threshold: dict) -> bool:
    """
    Run evaluation and fail CI if metrics fall below thresholds.
    Returns True if all metrics pass, False if any fail.
    """
    with open(golden_dataset_path) as f:
        golden = json.load(f)

    results = []
    for item in golden:
        answer, contexts = rag_pipeline(item["question"])
        results.append({
            "question": item["question"],
            "answer": answer,
            "contexts": contexts,
            "ground_truth": item["ground_truth"],
        })

    from datasets import Dataset
    dataset = Dataset.from_list(results)
    scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

    passed = True
    for metric, score in scores.items():
        threshold_val = threshold.get(metric, 0.7)
        status = "PASS" if score >= threshold_val else "FAIL"
        print(f"{metric}: {score:.3f} (threshold: {threshold_val}) [{status}]")
        if score < threshold_val:
            passed = False

    return passed


# CI thresholds - tighten over time as the system matures
CI_THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
    "context_precision": 0.70,
    "context_recall": 0.65,
}

# In CI:
# passed = run_ci_eval(my_rag_pipeline, "golden_dataset.json", CI_THRESHOLDS)
# sys.exit(0 if passed else 1)

Running eval in CI prevents regressions from shipping. Start with loose thresholds (0.65-0.70) and tighten them as the system improves. Any PR that drops a metric below the threshold must fix the regression before merging.

Segment Your Eval by Query Type

Aggregate scores hide important variation. Always report metrics segmented by query category:

Query Type	Faithfulness	Context Recall	Answer Relevancy	Notes
Simple factual	0.95	0.90	0.92	Usually easy for RAG
Multi-hop reasoning	0.71	0.63	0.78	Recall often low - needs more chunks
Comparison queries	0.68	0.72	0.80	Faithfulness suffers - LLM infers
Temporal queries	0.82	0.69	0.85	Recall varies with indexing freshness
Hypothetical queries	0.60	0.55	0.74	Hardest - model tends to speculate

When you see one segment with systematically low faithfulness, investigate that specific segment. Multi-hop queries failing faithfulness usually means the LLM is connecting information across chunks in ways that aren't explicitly stated in any single chunk - consider prompt changes that instruct the model to only state what is directly supported, not inferred. Temporal queries failing recall usually means your index isn't updated frequently enough - the relevant recent documents aren't indexed yet.

Summary

Evaluation is the only reliable guide to RAG improvement. The RAGAS framework gives you four complementary metrics - faithfulness (is the answer grounded?), answer relevancy (does it address the question?), context precision (is the retrieved context clean?), and context recall (is the right information retrieved?) - that together diagnose whether failures come from retrieval, generation, or both. Build your golden dataset early (even 50 pairs is a start), run eval continuously, segment by query type, and gate every engineering change on measured improvement. Without eval, you're optimizing by intuition. With eval, you're doing engineering.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RAG Evaluation with RAGAS demo on the EngineersOfAI Playground - no code required.

:::

The Silent Degradation Problem​

The Three Failure Modes of RAG Systems​

The RAG Triad​

RAGAS: The Standard Evaluation Framework​

Faithfulness​

Answer Relevancy​

Context Precision​

Context Recall​

Configuring RAGAS with Specific Models​

Building a Full RAG Evaluation Pipeline​

LLM-as-Judge: Building Custom Evaluators​

Building a Golden Dataset​

Online Evaluation: Production Monitoring​

TruLens: Alternative Evaluation Framework​

NDCG and Retrieval-Specific Metrics​

Evaluation-Driven Development for RAG​

Production: When to Re-Evaluate​

Common Mistakes​

Interview Questions and Answers​

Eval-Driven Development Workflow​

Automating Eval in CI/CD​

Segment Your Eval by Query Type​

Summary​