:::tip 🎮 Interactive Playground Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required. :::
Synthetic Data for RAG: Building Your Own Evaluation Dataset
It is a Tuesday morning and you are presenting your RAG system to the Head of Legal. She asks one question: "What is your system's accuracy rate?" You freeze. You spent three months building the retrieval pipeline, chunking strategy, reranker, and prompt templates. You tested it on about thirty questions you made up yourself. You liked what you saw. But you have no actual number. No ground-truth dataset. No systematic evaluation. You say "we're still measuring that" and watch her confidence in the project drop visibly.
This is not a hypothetical. It is the exact situation that most RAG teams find themselves in after deployment. They built a capable system but shipped it without a ground-truth evaluation dataset for their specific document corpus. Manual question construction is expensive - if you have ten thousand company policy documents, building a proper human-labeled QA dataset requires domain experts to read the documents, craft meaningful questions, and write verified answers. That is weeks of work and tens of thousands of dollars. Most teams never do it. They test a handful of examples, like what they see, and ship.
Then production teaches them what they missed. The system is confidently wrong on multi-hop questions that require connecting information from two different documents. It hallucinates answers to questions that fall just outside the corpus. It retrieves the right chunk but generates an answer that subtly contradicts it. Support tickets trickle in. Angry users escalate. The team scrambles to understand the failure pattern, but they have no systematic way to isolate the problem because they have no ground-truth evaluation dataset.
Synthetic data generation transforms this economics entirely. You can generate a domain-specific QA evaluation dataset from your corpus automatically, for dollars, in hours. A ten-thousand-document corpus can yield five thousand high-quality evaluation questions overnight, covering factual retrieval, multi-hop reasoning, procedural tasks, adversarial probes, and boundary cases. The quality is not perfect - but it is good enough to reveal real failure modes, and you can selectively improve coverage of high-stakes topics with targeted human review. More importantly, you can compute that accuracy number your Head of Legal asked for.
Why RAG Evaluation Is a Different Problem
Synthetic data for instruction tuning generates instructions and responses from scratch. Synthetic data for RAG is different: you start with existing documents and generate questions those documents can answer. The challenge is not creativity - it is coverage, diversity, and realism. You need to generate questions that reflect what real users actually ask, not what an LLM thinks users would ask.
There is a second dimension that makes RAG evaluation uniquely complex: the system has two failure modes that fail independently. The retriever can fail (wrong chunks returned). The generator can fail (wrong answer given good chunks). A comprehensive evaluation dataset must distinguish between these two failure types, because the fixes are completely different.
Question Taxonomy for Comprehensive RAG Coverage
Not all questions test the same capability. A RAG system that scores 90% on factual questions may collapse to 40% on multi-hop questions - and you would never know if your evaluation dataset only contains one type. Build coverage across all five question types.
| Question Type | What It Tests | Typical Failure Mode | Example |
|---|---|---|---|
| Factual | Basic retrieval accuracy and extraction | Retrieves wrong chunk, or generates wrong fact from right chunk | "What is the maximum file upload size for free accounts?" |
| Procedural | Multi-step instruction following and ordering | Correct steps but wrong order, or missing a step | "How do I set up two-factor authentication?" |
| Multi-hop | Chaining information from 2+ chunks | Retriever only finds one chunk, missing the connection | "What is the refund policy for users on the Enterprise plan?" |
| Adversarial | Hallucination resistance and scope awareness | Model invents an answer rather than declining | "What was the company's revenue in Q3 2019?" |
| Comparative | Reasoning across multiple document sections | Mixes up which plan has which feature | "How does the free plan differ from the Pro plan for API rate limits?" |
The adversarial type deserves special attention. These are questions that fall just outside your corpus - similar to questions the system can answer, but not quite answerable from the documents. They test whether your RAG system knows what it does not know. A system that scores 85% on factual questions but answers 70% of adversarial questions with invented facts is dangerous in production.
Core Implementation: The RAG Dataset Generator
import anthropic
import json
import random
import re
import hashlib
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional
from collections import Counter
client = anthropic.Anthropic()
@dataclass
class DocumentChunk:
"""A single retrievable chunk from the source corpus."""
chunk_id: str
document_id: str
document_title: str
content: str
page_number: Optional[int] = None
section_title: Optional[str] = None
importance_weight: float = 1.0 # Higher = more questions generated
metadata: dict = field(default_factory=dict)
def __post_init__(self):
if not self.chunk_id:
self.chunk_id = hashlib.md5(
self.content.encode()
).hexdigest()[:12]
@dataclass
class QAPair:
"""A question-answer pair for RAG evaluation."""
question_id: str
question: str
ground_truth_answer: str
source_chunk_ids: list[str]
question_type: str # factual, procedural, multi_hop, adversarial, comparative
difficulty: str # easy, medium, hard
requires_synthesis: bool # True if answer needs combining info from multiple chunks
metadata: dict = field(default_factory=dict)
@dataclass
class FilterResult:
"""Result of quality filtering for a QA pair."""
passed: bool
reason: str
score: float = 1.0
# ── Question Generation Prompts ──────────────────────────────────────────────
FACTUAL_PROMPT = """You are building an evaluation dataset for a Retrieval-Augmented Generation (RAG) system.
Document chunk:
{chunk_content}
Generate {n_questions} factual questions that:
1. Can be answered DIRECTLY from this chunk alone
2. Require a specific fact present in the chunk (not general knowledge)
3. Have a clear, unambiguous answer derivable from the text
4. A real user would plausibly ask when using this system
5. Vary across different specific facts in the chunk
For each question, also write the ground truth answer from the chunk.
Output ONLY a JSON array:
[{{"question": "...", "answer": "...", "difficulty": "easy|medium|hard"}}]"""
PROCEDURAL_PROMPT = """You are building an evaluation dataset for a RAG system.
Document chunk (contains procedural/how-to content):
{chunk_content}
Generate {n_questions} procedural questions that:
1. Ask "how to" accomplish something described in this chunk
2. Require the full step-by-step procedure to answer correctly
3. Would be asked by someone actively trying to complete the task
4. Have answers that are the specific steps from the chunk
Output ONLY a JSON array:
[{{"question": "...", "answer": "...", "difficulty": "easy|medium|hard"}}]"""
MULTI_HOP_PROMPT = """You are building an evaluation dataset for a RAG system.
Chunk A (from document: {doc_a}):
{chunk_a}
Chunk B (from document: {doc_b}):
{chunk_b}
Generate {n_questions} multi-hop questions that:
1. Require information from BOTH chunks to answer completely
2. CANNOT be answered from either chunk alone
3. Require connecting or synthesizing information across both chunks
4. A user who wants a comprehensive answer would naturally ask
For each question, write the complete ground truth answer that synthesizes both chunks.
Output ONLY a JSON array:
[{{"question": "...", "answer": "...", "requires_chunks": ["A", "B"]}}]"""
ADVERSARIAL_PROMPT = """You are building an adversarial evaluation dataset for a RAG system.
Document chunk:
{chunk_content}
Generate {n_questions} adversarial questions that:
1. Are SIMILAR in style and topic to questions this chunk could answer
2. CANNOT be fully answered from this chunk (information not present)
3. Test whether the RAG system will hallucinate rather than say "I don't know"
4. Are plausible questions a real user might ask
The correct system behavior for each is to decline or express uncertainty.
Output ONLY a JSON array:
[{{"question": "...", "why_adversarial": "what information is missing", "expected_behavior": "should_decline|partial_answer"}}]"""
COMPARATIVE_PROMPT = """You are building an evaluation dataset for a RAG system.
Document section A (topic: {topic_a}):
{chunk_a}
Document section B (topic: {topic_b}):
{chunk_b}
Generate {n_questions} comparative questions that:
1. Ask how A differs from B, or compare specific attributes
2. Require correctly understanding BOTH sections
3. Would be asked by someone trying to choose between options or understand differences
4. Have specific, factual answers derivable from both chunks
Output ONLY a JSON array:
[{{"question": "...", "answer": "...", "difficulty": "easy|medium|hard"}}]"""
Question Generation Functions
def generate_factual_questions(
chunk: DocumentChunk,
n_questions: int = 3,
model: str = "claude-haiku-4-5-20251001",
) -> list[QAPair]:
"""Generate factual QA pairs from a single chunk."""
prompt = FACTUAL_PROMPT.format(
chunk_content=chunk.content,
n_questions=n_questions
)
try:
response = client.messages.create(
model=model,
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if not json_match:
return []
qa_list = json.loads(json_match.group())
pairs = []
for i, qa in enumerate(qa_list):
if not qa.get("question") or not qa.get("answer"):
continue
pairs.append(QAPair(
question_id=f"{chunk.chunk_id}_factual_{i}",
question=qa["question"],
ground_truth_answer=qa["answer"],
source_chunk_ids=[chunk.chunk_id],
question_type="factual",
difficulty=qa.get("difficulty", "medium"),
requires_synthesis=False,
metadata={
"document_id": chunk.document_id,
"document_title": chunk.document_title,
"section_title": chunk.section_title,
}
))
return pairs
except (json.JSONDecodeError, KeyError, AttributeError, Exception):
return []
def generate_procedural_questions(
chunk: DocumentChunk,
n_questions: int = 2,
model: str = "claude-haiku-4-5-20251001",
) -> list[QAPair]:
"""Generate procedural QA pairs for how-to chunks."""
prompt = PROCEDURAL_PROMPT.format(
chunk_content=chunk.content,
n_questions=n_questions
)
try:
response = client.messages.create(
model=model,
max_tokens=1200,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if not json_match:
return []
qa_list = json.loads(json_match.group())
return [
QAPair(
question_id=f"{chunk.chunk_id}_proc_{i}",
question=qa["question"],
ground_truth_answer=qa["answer"],
source_chunk_ids=[chunk.chunk_id],
question_type="procedural",
difficulty=qa.get("difficulty", "medium"),
requires_synthesis=False,
metadata={"document_title": chunk.document_title}
)
for i, qa in enumerate(qa_list)
if qa.get("question") and qa.get("answer")
]
except Exception:
return []
def generate_multi_hop_questions(
chunk_a: DocumentChunk,
chunk_b: DocumentChunk,
n_questions: int = 2,
model: str = "claude-opus-4-6", # Use stronger model for multi-hop
) -> list[QAPair]:
"""
Generate multi-hop questions requiring both chunks.
Use the stronger model - multi-hop generation is hard and errors are costly.
"""
prompt = MULTI_HOP_PROMPT.format(
doc_a=chunk_a.document_title,
chunk_a=chunk_a.content,
doc_b=chunk_b.document_title,
chunk_b=chunk_b.content,
n_questions=n_questions
)
try:
response = client.messages.create(
model=model,
max_tokens=1500,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if not json_match:
return []
qa_list = json.loads(json_match.group())
pairs = []
for i, qa in enumerate(qa_list):
if not qa.get("question") or not qa.get("answer"):
continue
pairs.append(QAPair(
question_id=f"{chunk_a.chunk_id}_{chunk_b.chunk_id}_hop_{i}",
question=qa["question"],
ground_truth_answer=qa["answer"],
source_chunk_ids=[chunk_a.chunk_id, chunk_b.chunk_id],
question_type="multi_hop",
difficulty="hard",
requires_synthesis=True,
metadata={
"chunk_a_doc": chunk_a.document_title,
"chunk_b_doc": chunk_b.document_title,
"chunk_a_section": chunk_a.section_title,
"chunk_b_section": chunk_b.section_title,
}
))
return pairs
except Exception:
return []
def generate_adversarial_questions(
chunk: DocumentChunk,
n_questions: int = 1,
model: str = "claude-haiku-4-5-20251001",
) -> list[QAPair]:
"""Generate adversarial questions that test hallucination resistance."""
prompt = ADVERSARIAL_PROMPT.format(
chunk_content=chunk.content,
n_questions=n_questions
)
try:
response = client.messages.create(
model=model,
max_tokens=800,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if not json_match:
return []
qa_list = json.loads(json_match.group())
pairs = []
for i, qa in enumerate(qa_list):
if not qa.get("question"):
continue
pairs.append(QAPair(
question_id=f"{chunk.chunk_id}_adv_{i}",
question=qa["question"],
ground_truth_answer="This information is not available in the provided documentation.",
source_chunk_ids=[chunk.chunk_id],
question_type="adversarial",
difficulty="hard",
requires_synthesis=False,
metadata={
"expected_behavior": qa.get("expected_behavior", "should_decline"),
"why_adversarial": qa.get("why_adversarial", ""),
"document_title": chunk.document_title,
}
))
return pairs
except Exception:
return []
Quality Filtering Pipeline
Not every generated QA pair is usable. A filtering pipeline catches the most common failure modes: questions that are too vague, answers not grounded in the source chunk, questions that reference document structure rather than content, and trivially obvious questions.
# ── Structural and Heuristic Filters ──────────────────────────────────────
VAGUE_PATTERNS = [
"what is it", "what does this mean", "what is this",
"tell me about", "explain everything", "describe this",
"what happened", "what is the purpose",
]
META_PATTERNS = [
"this document", "this passage", "this text", "as stated above",
"according to this", "in this section", "the above", "as mentioned",
"this chunk", "this excerpt",
]
QUESTION_WORDS = ["what", "how", "why", "when", "where", "who", "which", "can", "does", "is", "are"]
STOP_WORDS = {
"the", "a", "an", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "do", "does", "did", "will", "would", "could",
"should", "may", "might", "shall", "can", "in", "on", "at", "to",
"for", "of", "and", "or", "but", "it", "its", "this", "that",
"with", "from", "by", "as", "not", "no", "so", "if", "but",
}
def structural_filter(pair: QAPair) -> FilterResult:
"""Check basic structural requirements."""
q = pair.question.strip()
a = pair.ground_truth_answer.strip()
# Question must be non-empty
if len(q) < 10:
return FilterResult(False, "question_too_short")
# Must be a question (ends with ? or starts with question word)
is_question = q.endswith("?") or any(
q.lower().startswith(w) for w in QUESTION_WORDS
)
if not is_question:
return FilterResult(False, "not_a_question")
# For non-adversarial: answer must have substance
if pair.question_type != "adversarial":
if len(a.split()) < 5:
return FilterResult(False, "answer_too_short")
return FilterResult(True, "passed", 1.0)
def heuristic_filter(pair: QAPair, source_chunk: DocumentChunk) -> FilterResult:
"""Check question quality heuristics."""
q_lower = pair.question.lower()
# Question too vague
if any(p in q_lower for p in VAGUE_PATTERNS):
return FilterResult(False, "question_too_vague")
# Question references document structure (should be self-contained)
if any(p in q_lower for p in META_PATTERNS):
return FilterResult(False, "references_document_structure")
# For factual and procedural: verify answer is grounded in source chunk
if pair.question_type in ("factual", "procedural"):
answer_words = set(pair.ground_truth_answer.lower().split()) - STOP_WORDS
chunk_words = set(source_chunk.content.lower().split())
if len(answer_words) > 0:
overlap = len(answer_words & chunk_words) / len(answer_words)
if overlap < 0.25:
return FilterResult(False, "answer_not_grounded_in_chunk", overlap)
return FilterResult(True, "passed", 1.0)
def answerability_check(
pair: QAPair,
source_chunk: DocumentChunk,
model: str = "claude-haiku-4-5-20251001",
) -> FilterResult:
"""
Use LLM to verify a question can actually be answered from the chunk.
More expensive - apply only after heuristic filters pass.
"""
if pair.question_type == "adversarial":
# Adversarial questions should NOT be answerable - skip this check
return FilterResult(True, "adversarial_skipped", 1.0)
# For multi-hop, we check both chunks exist but do a simplified check
context = source_chunk.content
prompt = f"""Given this document excerpt, can the question be answered using ONLY the information here?
Document:
{context}
Question: {pair.question}
Answer with JSON only:
{{"answerable": true/false, "confidence": 0.0-1.0, "reason": "one sentence"}}"""
try:
response = client.messages.create(
model=model,
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\{.*\}', text, re.DOTALL)
if json_match:
result = json.loads(json_match.group())
answerable = result.get("answerable", True)
confidence = float(result.get("confidence", 0.5))
if not answerable and confidence > 0.7:
return FilterResult(
False,
f"not_answerable: {result.get('reason', '')}",
confidence
)
except Exception:
pass
return FilterResult(True, "passed", 1.0)
def validate_qa_pair(
pair: QAPair,
chunk_map: dict[str, DocumentChunk],
use_llm_answerability: bool = False,
) -> FilterResult:
"""Run all filters for a QA pair. Returns first failure or pass."""
# Get primary source chunk
primary_chunk_id = pair.source_chunk_ids[0] if pair.source_chunk_ids else None
source_chunk = chunk_map.get(primary_chunk_id) if primary_chunk_id else None
if not source_chunk and pair.question_type != "adversarial":
return FilterResult(False, "source_chunk_not_found")
# Stage 1: Structural
result = structural_filter(pair)
if not result.passed:
return result
# Stage 2: Heuristic
if source_chunk:
result = heuristic_filter(pair, source_chunk)
if not result.passed:
return result
# Stage 3: LLM answerability (optional, costs money)
if use_llm_answerability and source_chunk:
result = answerability_check(pair, source_chunk)
if not result.passed:
return result
return FilterResult(True, "passed", 1.0)
Detecting Procedural Chunks
A key optimization: only generate procedural questions from chunks that contain procedures. Running the procedural prompt on a definition-style chunk wastes money and produces low-quality output.
PROCEDURAL_SIGNALS = [
# Numbered steps
"step 1", "step 2", "step 3", "step one", "step two",
# Ordered list signals
"first,", "second,", "third,", "finally,", "then,",
# Explicit instruction language
"to do this", "follow these steps", "instructions:", "procedure:",
"how to", "in order to", "you will need to", "navigate to",
"click on", "select ", "enter your", "type in", "press ",
# Document structure signals
"requirements:", "prerequisites:", "before you begin",
]
COMPARATIVE_SIGNALS = [
"vs", "versus", "compared to", "difference between", "unlike",
"on the other hand", "in contrast", "alternatively", "whereas",
"while ", "however,", "but ", "plan a", "plan b",
"basic plan", "pro plan", "enterprise plan", "free tier", "paid tier",
]
def classify_chunk_content(chunk: DocumentChunk) -> set[str]:
"""
Classify what question types are appropriate for this chunk.
Returns a set of applicable types.
"""
content_lower = chunk.content.lower()
applicable = {"factual", "adversarial"} # Always applicable
# Check for procedural content
procedural_score = sum(
1 for signal in PROCEDURAL_SIGNALS
if signal in content_lower
)
if procedural_score >= 2:
applicable.add("procedural")
# Check for comparative content
comparative_score = sum(
1 for signal in COMPARATIVE_SIGNALS
if signal in content_lower
)
if comparative_score >= 2:
applicable.add("comparative")
return applicable
The Complete RAG Dataset Generator
class RAGDatasetGenerator:
"""
Complete pipeline for generating RAG evaluation datasets from documents.
Usage:
chunks = load_your_chunks() # List[DocumentChunk]
generator = RAGDatasetGenerator(
factual_per_chunk=3,
adversarial_per_chunk=1,
multi_hop_pairs_per_doc=5,
)
dataset = generator.generate_from_corpus(chunks)
"""
def __init__(
self,
factual_per_chunk: int = 3,
procedural_per_chunk: int = 2,
adversarial_per_chunk: int = 1,
multi_hop_pairs_per_doc: int = 5,
min_chunk_words: int = 50,
use_llm_answerability: bool = False, # Expensive; enable for high-stakes
target_type_distribution: Optional[dict] = None,
):
self.factual_per_chunk = factual_per_chunk
self.procedural_per_chunk = procedural_per_chunk
self.adversarial_per_chunk = adversarial_per_chunk
self.multi_hop_pairs_per_doc = multi_hop_pairs_per_doc
self.min_chunk_words = min_chunk_words
self.use_llm_answerability = use_llm_answerability
self.target_distribution = target_type_distribution or {
"factual": 0.45,
"procedural": 0.20,
"multi_hop": 0.15,
"adversarial": 0.15,
"comparative": 0.05,
}
self._stats = {
"chunks_processed": 0,
"pairs_generated": 0,
"pairs_filtered": 0,
"filter_reasons": Counter(),
}
def generate_from_corpus(
self,
chunks: list[DocumentChunk],
output_path: str = "rag_eval_dataset.jsonl",
skip_short_chunks: bool = True,
) -> list[QAPair]:
"""
Generate a complete RAG evaluation dataset from document chunks.
Args:
chunks: List of DocumentChunk objects from your corpus
output_path: Where to save the JSONL dataset
skip_short_chunks: Skip chunks with < min_chunk_words words
Returns:
List of validated QAPair objects
"""
chunk_map = {c.chunk_id: c for c in chunks}
all_pairs: list[QAPair] = []
print(f"Generating RAG evaluation dataset from {len(chunks)} chunks...")
# Filter out short chunks (they produce poor questions)
usable_chunks = [
c for c in chunks
if len(c.content.split()) >= self.min_chunk_words
] if skip_short_chunks else chunks
print(f" Usable chunks after length filter: {len(usable_chunks)}")
# ── Per-chunk generation ─────────────────────────────────────────
for i, chunk in enumerate(usable_chunks):
if i % 20 == 0:
print(f" Processing chunk {i+1}/{len(usable_chunks)}...")
applicable_types = classify_chunk_content(chunk)
chunk_pairs = []
# Factual questions (always)
n_factual = round(self.factual_per_chunk * chunk.importance_weight)
factual = generate_factual_questions(chunk, n_factual)
chunk_pairs.extend(factual)
# Procedural (only if chunk has procedural content)
if "procedural" in applicable_types:
proc = generate_procedural_questions(chunk, self.procedural_per_chunk)
chunk_pairs.extend(proc)
# Adversarial
adv = generate_adversarial_questions(chunk, self.adversarial_per_chunk)
chunk_pairs.extend(adv)
all_pairs.extend(chunk_pairs)
self._stats["chunks_processed"] += 1
self._stats["pairs_generated"] += len(chunk_pairs)
# Rate limiting: be kind to the API
if i > 0 and i % 50 == 0:
time.sleep(1.0)
# ── Multi-hop generation (per document) ─────────────────────────
doc_chunks: dict[str, list[DocumentChunk]] = {}
for chunk in usable_chunks:
doc_chunks.setdefault(chunk.document_id, []).append(chunk)
total_hop = 0
for doc_id, doc_chunk_list in doc_chunks.items():
if len(doc_chunk_list) < 2:
continue
# Sample non-adjacent chunk pairs to encourage true multi-hop
all_pairs_candidate = [
(a, b)
for i, a in enumerate(doc_chunk_list)
for b in doc_chunk_list[i+2:] # Skip adjacent chunks
]
if not all_pairs_candidate:
# Fallback: allow adjacent if not enough options
all_pairs_candidate = [
(a, b)
for i, a in enumerate(doc_chunk_list)
for b in doc_chunk_list[i+1:]
]
n_pairs = min(self.multi_hop_pairs_per_doc, len(all_pairs_candidate))
selected_pairs = random.sample(all_pairs_candidate, n_pairs)
for chunk_a, chunk_b in selected_pairs:
hop_pairs = generate_multi_hop_questions(chunk_a, chunk_b, n_questions=2)
all_pairs.extend(hop_pairs)
total_hop += len(hop_pairs)
print(f" Generated {total_hop} multi-hop pairs across {len(doc_chunks)} documents")
# ── Quality filtering ─────────────────────────────────────────────
validated_pairs = []
for pair in all_pairs:
result = validate_qa_pair(
pair,
chunk_map,
use_llm_answerability=self.use_llm_answerability
)
if result.passed:
validated_pairs.append(pair)
else:
self._stats["pairs_filtered"] += 1
self._stats["filter_reasons"][result.reason] += 1
all_pairs = validated_pairs
print(f" Validation: {len(all_pairs)} passed, {self._stats['pairs_filtered']} filtered")
# ── Save to JSONL ─────────────────────────────────────────────────
with open(output_path, "w") as f:
for pair in all_pairs:
f.write(json.dumps({
"question_id": pair.question_id,
"question": pair.question,
"ground_truth_answer": pair.ground_truth_answer,
"source_chunk_ids": pair.source_chunk_ids,
"question_type": pair.question_type,
"difficulty": pair.difficulty,
"requires_synthesis": pair.requires_synthesis,
"metadata": pair.metadata,
}) + "\n")
self._print_summary(all_pairs)
return all_pairs
def _print_summary(self, pairs: list[QAPair]) -> None:
"""Print dataset statistics."""
if not pairs:
print("No pairs generated.")
return
type_counts = Counter(p.question_type for p in pairs)
difficulty_counts = Counter(p.difficulty for p in pairs)
print(f"\nDataset Summary")
print(f"{'='*40}")
print(f"Total pairs: {len(pairs)}")
print(f"\nBy type:")
for qtype, count in sorted(type_counts.items()):
pct = count / len(pairs) * 100
bar = "#" * round(pct / 2)
print(f" {qtype:15s} {count:4d} ({pct:4.1f}%) {bar}")
print(f"\nBy difficulty:")
for diff, count in sorted(difficulty_counts.items()):
pct = count / len(pairs) * 100
print(f" {diff:8s} {count:4d} ({pct:4.1f}%)")
print(f"\nFilter reasons:")
for reason, count in self._stats["filter_reasons"].most_common(5):
print(f" {reason:40s} {count:4d}")
RAG System Evaluation
Once you have the evaluation dataset, run your RAG system against it. The key insight: evaluate retrieval and generation separately, because they fail for different reasons and require different fixes.
@dataclass
class RAGSystemResult:
"""Outcome of querying RAG system for one question."""
question_id: str
question: str
ground_truth: str
retrieved_chunk_ids: list[str]
generated_answer: str
# Computed metrics
retrieval_precision: float = 0.0
retrieval_recall: float = 0.0
answer_faithfulness: float = 0.0
answer_relevance: float = 0.0
answer_correctness: float = 0.0
def evaluate_retrieval(
result: RAGSystemResult,
qa_pair: QAPair,
) -> tuple[float, float]:
"""
Compute retrieval precision and recall.
Precision: what fraction of retrieved chunks were relevant?
Recall: what fraction of required chunks were retrieved?
A system can have high recall (gets the right chunks) but low precision
(also retrieves many irrelevant chunks), or vice versa.
"""
required = set(qa_pair.source_chunk_ids)
retrieved = set(result.retrieved_chunk_ids)
if not retrieved:
return 0.0, 0.0
precision = len(required & retrieved) / len(retrieved)
recall = len(required & retrieved) / len(required) if required else 0.0
return precision, recall
def evaluate_answer_faithfulness(
generated_answer: str,
retrieved_chunks_content: list[str],
) -> float:
"""
Measure whether the answer is supported by the retrieved context.
Unfaithful = hallucinated information not in the context.
"""
if not retrieved_chunks_content:
return 0.0
context = "\n\n---\n\n".join(retrieved_chunks_content[:5])
prompt = f"""Is the following answer fully supported by the provided context?
Context:
{context[:3000]}
Answer: {generated_answer}
Rules:
- Rate 1.0 if every claim in the answer is explicitly supported by the context
- Rate 0.5 if most claims are supported but some are inferred beyond what's written
- Rate 0.0 if the answer contains significant unsupported or contradicted claims
Respond with ONLY a decimal number between 0.0 and 1.0."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
try:
return min(1.0, max(0.0, float(response.content[0].text.strip())))
except ValueError:
return 0.5
def evaluate_answer_correctness(
ground_truth: str,
generated: str,
question_type: str,
) -> float:
"""
Compare generated answer to ground truth answer.
For adversarial: check if system correctly declined.
"""
# Special handling for adversarial questions
if question_type == "adversarial":
decline_signals = [
"not available", "don't have", "do not have", "cannot find",
"not in the documentation", "not mentioned", "unable to find",
"i don't know", "not sure", "no information", "not provided",
]
correctly_declined = any(
signal in generated.lower() for signal in decline_signals
)
return 1.0 if correctly_declined else 0.0
prompt = f"""Compare these two answers for factual agreement. Do they convey the same information?
Ground truth: {ground_truth}
Generated answer: {generated}
Score:
- 1.0: Same meaning and key facts
- 0.7: Mostly correct with minor omissions
- 0.5: Partially correct, missing key information
- 0.2: Mostly wrong or misleading
- 0.0: Completely wrong or irrelevant
Respond with ONLY a decimal number."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
try:
return min(1.0, max(0.0, float(response.content[0].text.strip())))
except ValueError:
return 0.5
def run_evaluation(
qa_pairs: list[QAPair],
rag_query_fn, # Callable: (question: str) -> (chunk_ids: list[str], chunk_contents: list[str], answer: str)
sample_size: int = 200,
stratify_by_type: bool = True,
) -> dict:
"""
Run full RAG evaluation against the synthetic dataset.
Args:
qa_pairs: The evaluation dataset
rag_query_fn: Your RAG system's query function
sample_size: How many questions to evaluate (cost control)
stratify_by_type: Sample proportionally from each question type
"""
# Stratified sampling: proportional across question types
if stratify_by_type:
by_type: dict[str, list[QAPair]] = {}
for pair in qa_pairs:
by_type.setdefault(pair.question_type, []).append(pair)
sample = []
for qtype, type_pairs in by_type.items():
n = max(1, round(sample_size * len(type_pairs) / len(qa_pairs)))
sample.extend(random.sample(type_pairs, min(n, len(type_pairs))))
sample = sample[:sample_size]
else:
sample = random.sample(qa_pairs, min(sample_size, len(qa_pairs)))
pair_map = {p.question_id: p for p in qa_pairs}
results = []
print(f"Evaluating {len(sample)} questions...")
for i, pair in enumerate(sample):
if i % 20 == 0:
print(f" Question {i+1}/{len(sample)}")
try:
chunk_ids, chunk_contents, generated_answer = rag_query_fn(pair.question)
result = RAGSystemResult(
question_id=pair.question_id,
question=pair.question,
ground_truth=pair.ground_truth_answer,
retrieved_chunk_ids=chunk_ids,
generated_answer=generated_answer,
)
# Retrieval evaluation
result.retrieval_precision, result.retrieval_recall = evaluate_retrieval(
result, pair
)
# Generation evaluation
result.answer_faithfulness = evaluate_answer_faithfulness(
generated_answer, chunk_contents
)
result.answer_correctness = evaluate_answer_correctness(
pair.ground_truth_answer, generated_answer, pair.question_type
)
results.append(result)
except Exception as e:
print(f" Error on {pair.question_id}: {e}")
continue
# Aggregate metrics
def mean(vals: list) -> float:
return sum(vals) / len(vals) if vals else 0.0
overall = {
"n_evaluated": len(results),
"retrieval_precision": mean([r.retrieval_precision for r in results]),
"retrieval_recall": mean([r.retrieval_recall for r in results]),
"answer_faithfulness": mean([r.answer_faithfulness for r in results]),
"answer_correctness": mean([r.answer_correctness for r in results]),
}
# Break down by question type
by_type_results: dict[str, list[RAGSystemResult]] = {}
for result in results:
qtype = pair_map[result.question_id].question_type
by_type_results.setdefault(qtype, []).append(result)
by_type_metrics = {}
for qtype, type_results in by_type_results.items():
by_type_metrics[qtype] = {
"n": len(type_results),
"retrieval_recall": mean([r.retrieval_recall for r in type_results]),
"answer_correctness": mean([r.answer_correctness for r in type_results]),
"answer_faithfulness": mean([r.answer_faithfulness for r in type_results]),
}
# Find worst-performing question types
sorted_types = sorted(
by_type_metrics.items(),
key=lambda x: x[1]["answer_correctness"]
)
overall["by_question_type"] = by_type_metrics
overall["weakest_types"] = [(t, m["answer_correctness"]) for t, m in sorted_types[:3]]
_print_evaluation_report(overall)
return overall
def _print_evaluation_report(metrics: dict) -> None:
"""Print a readable evaluation report."""
print(f"\nRAG System Evaluation Report")
print(f"{'='*50}")
print(f"Questions evaluated: {metrics['n_evaluated']}")
print(f"\nOverall Metrics:")
print(f" Retrieval Precision: {metrics['retrieval_precision']:.3f}")
print(f" Retrieval Recall: {metrics['retrieval_recall']:.3f}")
print(f" Answer Faithfulness: {metrics['answer_faithfulness']:.3f}")
print(f" Answer Correctness: {metrics['answer_correctness']:.3f}")
print(f"\nBy Question Type:")
for qtype, type_metrics in metrics.get("by_question_type", {}).items():
print(f" {qtype:15s} recall={type_metrics['retrieval_recall']:.2f} "
f"correctness={type_metrics['answer_correctness']:.2f} "
f"(n={type_metrics['n']})")
print(f"\nWeakest Areas:")
for qtype, score in metrics.get("weakest_types", []):
print(f" {qtype}: {score:.3f}")
Domain-Specific Question Templates
For specific domains, use structured templates to ensure realistic question coverage. LLMs default to generating academic-sounding questions. Domain templates ground the generation in real user intent.
# Customer support / help desk
SUPPORT_TEMPLATES = [
"How do I {action} my account?",
"What happens if I {scenario}?",
"Can I {request} without {consequence}?",
"What is the {policy_name} policy for {situation}?",
"How long does {process} take?",
"Why is {feature} not working?",
"How do I cancel my {subscription_type}?",
]
# Technical documentation
TECHNICAL_TEMPLATES = [
"How do I configure {parameter} for {use_case}?",
"What does the {error_code} error mean?",
"What is the maximum {resource_type} allowed per {scope}?",
"What are the requirements to use {feature}?",
"What is the difference between {option_a} and {option_b}?",
"How do I migrate from {old_version} to {new_version}?",
"What permissions are required for {action}?",
]
# Legal and compliance
COMPLIANCE_TEMPLATES = [
"Is {action} permitted under {policy}?",
"What are the requirements for {process} under {regulation}?",
"When is {activity} required?",
"Who is responsible for {obligation}?",
"What documentation is needed to {process}?",
"What are the penalties for {violation}?",
]
# Medical and clinical
CLINICAL_TEMPLATES = [
"What is the recommended {treatment} for {condition}?",
"What are the contraindications for {medication}?",
"What dosage of {drug} is appropriate for {patient_type}?",
"What are the side effects of {intervention}?",
"When should {screening} be performed?",
]
DOMAIN_TEMPLATES = {
"support": SUPPORT_TEMPLATES,
"technical": TECHNICAL_TEMPLATES,
"compliance": COMPLIANCE_TEMPLATES,
"clinical": CLINICAL_TEMPLATES,
}
def generate_template_questions(
chunk: DocumentChunk,
domain: str,
n_questions: int = 3,
) -> list[str]:
"""
Use domain templates to guide question generation.
Produces more realistic questions than open-ended generation.
"""
templates = DOMAIN_TEMPLATES.get(domain, [])
if not templates:
return []
# Sample a subset of templates
sampled = random.sample(templates, min(n_questions + 2, len(templates)))
template_str = "\n".join(f"- {t}" for t in sampled)
prompt = f"""Given this document excerpt, fill in the question templates to create specific, realistic questions.
Only create questions that can be answered from the document content.
Document:
{chunk.content}
Templates (fill in the {{placeholders}} based on the document):
{template_str}
Output filled-in questions as a JSON array of strings.
Skip templates that don't apply to this document.
ONLY output the JSON array."""
try:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=600,
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
json_match = re.search(r'\[.*\]', text, re.DOTALL)
if json_match:
questions = json.loads(json_match.group())
# Validate they look like questions
return [
q for q in questions
if isinstance(q, str) and len(q) > 15
and (q.endswith("?") or any(q.lower().startswith(w) for w in QUESTION_WORDS))
][:n_questions]
except Exception:
pass
return []
Tracking Improvements Over Time
The real value of a synthetic evaluation dataset is longitudinal tracking. Without it, you do not know if a change helped. With it, you can run a regression suite before every deployment.
Diagnosing Failure Modes
Different metric patterns point to different root causes:
| Pattern | Likely Root Cause | Fix |
|---|---|---|
| Low retrieval recall, low answer correctness | Chunking too coarse; relevant info split across chunk boundaries | Use semantic chunking, increase chunk overlap |
| High retrieval recall, low answer faithfulness | LLM generates beyond what is in retrieved context | Strengthen "answer only from context" in system prompt; add citation requirement |
| High retrieval recall, low answer correctness | Correct chunks retrieved but wrong info extracted | Check if chunks are too long; LLM loses focus mid-chunk |
| High adversarial score (model answers unanswerable) | Model does not know its knowledge boundaries | Add explicit training examples of declining with "not available in documentation" |
| Low multi-hop correctness, high factual correctness | Retriever only returns top-1 chunk; multi-hop needs K≥2 | Increase top-K retrieval; add query decomposition for complex questions |
| Good metrics on synthetic, poor on real users | Distribution mismatch: synthetic questions too easy or unnatural | Add real user questions (anonymized) to evaluation set |
Common Pitfalls
:::danger Chunks Too Large for Good Questions If your chunks are 3000+ tokens, the generated questions become high-level and vague ("What does this section discuss?") instead of specific and testable ("What is the timeout for unauthenticated API requests?"). The question quality directly correlates with chunk granularity. Keep chunks between 300 and 800 tokens for question generation. You can use larger chunks in your actual RAG system - generate questions from smaller sub-chunks and evaluate against the larger retrieval units. :::
:::warning Skipping Adversarial Questions Teams that only generate factual and procedural questions build a falsely optimistic picture of their system. They measure what the system should answer but not what it should decline. A system that gets 85% on factual questions but confidently fabricates answers to 70% of adversarial questions is dangerous in a compliance, medical, or legal context. Target at least 15% adversarial questions in your evaluation dataset. :::
:::tip Stratify by Document Importance
Not all documents are equally critical. A question about your refund policy for Enterprise customers matters more than a question about your blog post from 2019. Weight your evaluation dataset accordingly: use importance_weight > 1.0 on high-stakes chunks (legal terms, pricing, compliance policies) to generate more questions from them. This ensures your metrics reflect real-world failure risk, not just corpus coverage.
:::
:::info When to Use LLM Answerability Checking
The use_llm_answerability=True option doubles your filtering cost (one extra API call per QA pair) but significantly improves dataset quality for high-stakes domains. Enable it if you are building an evaluation dataset for a medical, legal, or financial RAG system where false positives (questions your dataset says are answerable but the chunk cannot actually answer) would cause your evaluation to overestimate real performance. For general-purpose RAG evaluation, the heuristic filters are sufficient.
:::
Interview Q&A
Q: Why use synthetic QA pairs for RAG evaluation instead of collecting real user questions?
Real user questions are valuable but insufficient as your only evaluation signal. The problems with relying on user data alone: (1) Cold start - you cannot evaluate a RAG system before deploying it, which is exactly when you most need evaluation to decide if it is ready. (2) Coverage bias - real users only ask questions they know to ask; they cannot reveal gaps in your system's knowledge or blind spots in retrieval. (3) Ground truth is expensive - for every real user question, what is the correct answer? You need domain experts to annotate each one. (4) Adversarial coverage - real users rarely deliberately probe your system's hallucination boundaries. Synthetic QA pairs solve all four: available before deployment, engineered to cover specific document sections, include generated ground truth, and can include adversarial examples by design. In mature production systems, you want both: a synthetic evaluation set for systematic pre-deployment coverage, and a sampled real-user set for actual-distribution monitoring. Start synthetic, evolve toward hybrid.
Q: Explain the difference between retrieval precision, retrieval recall, answer faithfulness, and answer correctness in RAG evaluation. Why does each matter separately?
These four metrics expose four different failure modes that require different fixes. Retrieval precision measures what fraction of retrieved chunks were actually relevant. Low precision means the retriever is noisy - it gets the right chunks but also floods the context with irrelevant ones, which degrades generation. Retrieval recall measures what fraction of the required chunks were retrieved. Low recall means the retriever missed relevant chunks entirely - the generator cannot produce a correct answer from incomplete context. Answer faithfulness measures whether the generated answer is supported by the retrieved context. Low faithfulness means the LLM is hallucinating - generating claims not present in the retrieved chunks, regardless of whether retrieval worked. Answer correctness measures whether the generated answer matches the ground truth. Low correctness with high faithfulness means the generator is accurately summarizing the wrong chunks (retrieval failed). Low correctness with high retrieval recall means the generator is failing to synthesize the right information from the right chunks (generation failed). You need all four metrics to correctly diagnose a failure.
Q: How do you build a multi-hop question that genuinely requires two chunks, rather than one that can be answered from either chunk alone?
The key is selecting chunk pairs that have a natural dependency or connection: one chunk defines a term or concept that the other chunk uses but does not define; one chunk describes a policy that another chunk applies to a specific case; one chunk contains a prerequisite condition and the other contains what happens when that condition is met. The generation prompt must explicitly state that the question cannot be answered from either chunk alone and must require combining information. After generation, validate with a two-step answerability check: ask the LLM to answer the question given only Chunk A (should fail), then only Chunk B (should fail), and only both together (should succeed). Questions that can be answered from one chunk alone should be filtered out. In practice, selecting non-adjacent chunks from the same document (separated by two or more chunks) produces better multi-hop questions because they are more likely to contain related but non-overlapping information.
Q: How would you use your synthetic RAG evaluation dataset to systematically improve your chunking strategy?
Run the evaluation dataset through the retrieval phase only (skip generation) and measure retrieval recall for each chunking strategy under test. For each QA pair, you know which chunks are required to answer it (stored in source_chunk_ids). For each chunking strategy, check whether those chunks are retrieved in the top-K results. The chunking strategy that maximizes recall on multi-hop questions (which require retrieving two or more chunks from different sections) while maintaining precision on factual questions is the winner for your corpus. Practically: generate the dataset with your initial chunking strategy, then re-chunk the corpus with candidate strategies (different sizes, overlap amounts, semantic vs. fixed-size boundaries), run the retrieval phase of each strategy against the same question set, compare recall@5 and recall@10 across strategies, and select the one that scores best on your highest-priority question types. This gives you an objective, corpus-specific benchmark for what is often a subjective choice.
Q: What special considerations apply when building a RAG evaluation dataset for sensitive domains like healthcare or legal?
Three main concerns. First, factual accuracy of generated ground truth: in sensitive domains, a wrong "ground truth" answer that your system matches will look like success but is actually a failure - and potentially a dangerous one. For clinical or legal content, generated answers must be validated by domain experts before use. Do not trust the LLM to correctly summarize dosage instructions or legal liability conditions. Second, privacy: your source documents may contain PHI (protected health information) or privileged legal communications. Mask or pseudonymize any PII before running chunks through question generation, and audit generated questions for accidental PII leakage. Third, adversarial coverage is more important in sensitive domains: a legal RAG system that makes up case citations it cannot find in the corpus, or a clinical system that invents drug dosages, can cause direct harm. Target 20-25% adversarial questions (higher than the 15% baseline) and include specific adversarial probes for the highest-risk failure modes in your domain - incorrect drug interactions, superseded legal statutes, outdated clinical guidelines. In high-stakes domains, treat the synthetic dataset as a starting point requiring expert validation, not a production-ready artifact.
Q: How do you handle distribution mismatch between your synthetic evaluation dataset and the real user questions your RAG system receives in production?
Distribution mismatch is real and unavoidable: LLMs tend to generate more formal, complete questions than real users ask, and they bias toward questions that are clearly answerable from the corpus. Three mitigation strategies. First, seed from real user logs: if you have any production query history (even a few hundred examples), cluster them by topic and intent, then use those clusters as seeds for synthetic generation - this grounds your diversity in real user intent rather than LLM imagination. Second, use domain-specific question templates (as shown above) that encode how users in your specific domain actually phrase questions - support templates for customer-facing systems, clinical templates for healthcare. Third, monitor for gaps post-deployment: periodically sample real user queries, cluster them, and compare the cluster distribution against your synthetic dataset clusters. When you see real user clusters underrepresented in your evaluation set, generate more synthetic examples from those clusters and add them to your dataset. Think of the synthetic dataset not as a fixed artifact but as a living document that converges toward your real user distribution over time.
