:::tip ๐ฎ Interactive Playground Visualize this concept: Try the Advanced RAG Patterns demo on the EngineersOfAI Playground - no code required. :::
Query Transformation and HyDE
The System That Worked for Half Its Usersโ
The customer support RAG system posted a 91% satisfaction score in its first week. The team celebrated. They had built it right - comprehensive documentation ingested, embeddings fresh, vector database properly tuned with HNSW indexes. The system worked beautifully for users who happened to phrase questions the way the documentation was written.
Then the support tickets started. A user typed "My API keeps timing out" and got exactly the right result - the docs used "timeout" as their vocabulary, the embedding aligned, the retrieved chunks were perfect. But the next user typed "I'm getting 504s when the load is high." Same problem, completely different vocabulary. The documentation talked about "gateway timeout errors" and "request timeout thresholds." It never used "504s." The embedding of "504s under load" sat in an entirely different neighborhood of semantic space than "gateway timeout." The system returned chunks about load balancing configuration - plausible, confidently presented, and useless.
It got worse with vague questions. "Why is my integration slow?" could mean connection pool exhaustion, network latency, slow database queries, API rate limiting, or a dozen other causes. A single embedding vector cannot represent this ambiguity - it collapses to a centroid equidistant from all possible answers, which means it is near none of them. The system retrieved a broad cloud of loosely related content. The LLM synthesized a response that mentioned several possible causes in a generic way that satisfied nobody. Users who needed a real diagnosis got a list of things to check.
The root cause was easy to miss during development: naive RAG assumes the user's query and the answer documents share vocabulary. In technical support, this assumption breaks constantly. Users describe symptoms in their own language - "504s," "it's slow," "keeps crashing," "my app is broken." Documents describe systems in technical language - "HTTP 504 Gateway Timeout," "request processing latency," "unhandled exception during runtime," "application process termination." The semantic gap between question and answer is often enormous, and embedding a raw user query does nothing to bridge it.
Query transformation is the engineering discipline built to close this gap. Instead of searching with the raw user query, you transform it first - into a hypothetical answer, into multiple paraphrases that explore different semantic neighborhoods, into a more abstract question that retrieves foundational context, or into a set of decomposed sub-questions for multi-hop problems. The transformation happens before vector search, and its quality directly determines what your RAG system finds.
This lesson covers five transformation techniques in production depth: HyDE, multi-query retrieval with Reciprocal Rank Fusion, step-back prompting, query decomposition, and query routing. For each, you will understand the geometric and linguistic intuition, the failure modes, the cost model, and how to integrate it into a production pipeline that automatically selects the right strategy.
Why Query Transformation Existsโ
The Vocabulary Mismatch Problemโ
When you embed a query and search for similar vectors, you are making one fundamental assumption: the semantic content of the question and the semantic content of the relevant answers occupy nearby regions in the embedding space. This assumption holds well for well-formed queries against well-formed documents in matching registers. It breaks down in ways that are predictable and systematically harmful.
User language vs. document language: A user writes "my app crashes when I save." The documentation says "null pointer exception during persistence layer invocation." These describe the same event. But "crashes when I save" embeds near informal user forum posts and casual descriptions of software failures - not near formal documentation about exception handling. Dense retrieval finds things that sound like the question rather than things that answer it.
Domain abbreviation gaps: Technical users communicate in abbreviations: "504s," "OOM," "SIGKILL," "p99 latency," "CKD stage 3." These terms may appear rarely in the embedding model's training corpus or may be tokenized as rare token sequences. The resulting embeddings are low-quality - they do not reliably place the query near related concepts. A query containing "OOM" may not retrieve documents about "out-of-memory errors" or "memory allocation failures" because the abbreviation and the full term sit in different regions of embedding space.
Cross-domain synonymy: "Machine learning model training," "neural network optimization," "model fitting," and "gradient descent learning" all describe the same activity. Different documents written by different authors will use different terms. A single query using one term misses documents using the other terms.
The Specificity Problemโ
Vague queries are structurally ambiguous - they are simultaneously compatible with many specific questions. "Why is it slow?" is not a single query in any meaningful sense. It is a family of queries with different correct answers depending on context. A single embedding vector cannot represent this family - it collapses to a single point. The result is retrieval that is superficially broad but practically useless: documents about many different causes of slowness, none of which is necessarily the right cause for this user.
The correct handling of vague queries is not to retrieve broadly and hope the LLM figures it out. It is to decompose the ambiguity into multiple specific questions and retrieve for each - or to route the user to a clarification flow before retrieval begins.
The Multi-Facet Problemโ
Complex questions require multiple retrieval passes because relevant information is distributed across multiple non-adjacent regions of document space. "What are the tax implications of converting a traditional IRA to a Roth IRA in a year with significant capital gains?" touches IRA conversion tax treatment, ordinary income vs. capital gains interaction, annual income phase-outs, and potentially state-level rules. No single query vector sits near all of these simultaneously. You need separate retrieval passes - one per facet - and synthesis.
The Hypothesis Mismatchโ
Dense retrieval finds documents semantically similar to the query. But you want documents that answer the query, not documents that resemble the query. A question and its answer have different structure, different vocabulary, and different register. "What causes X?" and "X is caused by Y because of mechanism Z" share semantic meaning but sit in different regions of embedding space. The embedding of a question is near other questions - near the way people ask things, not near the way documentation explains things.
HyDE directly resolves this: instead of embedding the question, generate a document that would answer the question and embed that. A hypothetical answer-shaped text naturally lands near real answer documents in embedding space.
Historical Contextโ
The recognition that query vocabulary and document vocabulary diverge has driven information retrieval research since the early days of the field. Pseudo-relevance feedback (Rocchio, 1971) expanded queries by extracting terms from the top-retrieved documents, adding them back to the original query, and re-retrieving. The weakness was circular: the quality of expansion depended on the quality of the initial retrieval, which was often poor precisely because of the vocabulary mismatch being addressed.
Word-based query expansion using synonym dictionaries and WordNet attempted to bridge vocabulary gaps lexically, with limited success - synonymy at the word level does not capture semantic equivalence at the document level.
Dense retrieval (DPR, Karpukhin et al., 2020) improved vocabulary mismatch handling significantly by embedding both queries and documents into a shared semantic space where related concepts cluster regardless of exact lexical form. But the question-answer geometric gap remained.
HyDE (Hypothetical Document Embeddings) was introduced by Gao et al. in December 2022 in "Precise Zero-Shot Dense Retrieval without Relevance Labels." The core insight was counterintuitive: do not try to improve the query embedding directly. Instead, generate a document that would answer the query - factually unreliable but stylistically correct - and embed that. The generated document shares vocabulary, structure, and register with real answer documents. Its embedding sits geometrically near real answers even when the raw question embedding does not.
Step-back prompting (Zheng et al., Google DeepMind, 2023) observed that LLMs reason better when they first reason about the general principle before the specific instance. Applied to RAG: stepping back to a general question retrieves background context that grounds the specific answer.
Multi-query retrieval emerged as a practical engineering pattern as teams discovered high retrieval variance from single queries. Combined with Reciprocal Rank Fusion (Cormack et al., 2009) for stable result merging, it became a production standard by 2023.
Query decomposition has roots in complex question answering research but became practically deployable only when capable instruction-following LLMs could reliably decompose natural language questions and synthesize sub-answers.
Transformation Technique 1: HyDEโ
The Geometric Intuitionโ
In a high-dimensional embedding space, documents cluster by semantic content and linguistic register. Technical documentation clusters together. User support forum posts cluster together. Research papers cluster together. Questions cluster with other questions.
When you embed a user's question, the resulting vector lands in the "question cluster" - geometrically near other questions, not near answers. The answer to the question lives in the "answer cluster," which may be far away in embedding space even though it is semantically relevant.
HyDE generates a text that looks like an answer - formal, declarative, documentation-style - and embeds that instead. This generated text lands in the answer cluster, near real answer documents, making the search vector geometrically appropriate for finding answers rather than finding other questions.
Mathematically: given query , instead of computing and searching, compute:
The hypothetical document is factually unreliable - the LLM may hallucinate specific details. But factual accuracy is not the goal of hypothetical generation. The goal is geometric placement: producing a text whose embedding sits in the right neighborhood of the vector space.
When HyDE Worksโ
HyDE performs best when:
- Documents use formal, structured language that a well-prompted LLM can replicate: API documentation, research papers, legal contracts, medical references, software specifications.
- The vocabulary gap is a register gap: the user speaks informally, the documents speak formally, and LLM generation bridges the register difference.
- The domain is within the LLM's training distribution: the LLM can generate a plausible hypothetical because it has seen similar documents during training.
When HyDE Failsโ
HyDE fails in predictable situations:
- Factual lookup over proprietary data: "What was our Q3 2024 revenue?" The LLM generates a plausible number that is almost certainly wrong. The embedding of the wrong hypothetical pulls search toward documents containing similar wrong numbers, potentially missing the document with the correct number.
- Highly proprietary knowledge: Internal procedures, proprietary research, private customer data - the LLM has never seen this content and cannot generate a useful hypothetical. It will generate a generic document that matches common patterns, not your specific corpus.
- Maximally ambiguous queries: "Why is it broken?" is too vague for the LLM to generate a useful specific hypothetical. The LLM picks one interpretation and collapses the ambiguity. Multi-query is better here.
Cost and Latency Modelโ
HyDE adds exactly one LLM call per query. Using claude-haiku-4-5-20251001 for hypothetical generation:
- Token count: typically 100-300 tokens output
- Latency: 100-300ms additional
- Cost: approximately $0.00003-0.00009 per query
This overhead is negligible for most production use cases. The retrieval quality improvement from HyDE typically justifies this cost many times over.
Transformation Technique 2: Multi-Query Retrieval with RRFโ
The Core Ideaโ
A single query is a single point in semantic space. A single point explores one neighborhood. If relevant documents are distributed across multiple neighborhoods - because the question is broad, because different authors use different vocabulary - a single query misses them.
Multi-query retrieval generates semantically diverse paraphrases of the original query. Each paraphrase explores a different neighborhood. The union of all retrieved results is richer than any single retrieval. Reciprocal Rank Fusion merges the ranked lists into a single coherent ranking that rewards documents consistently relevant across multiple query formulations.
Reciprocal Rank Fusionโ
Given ranked lists and a document appearing at position in list :
The constant was empirically determined by Cormack et al. (2009) to work well across diverse retrieval tasks. It dampens the contribution of very low-ranked positions, preventing any single highly-ranked result from dominating. Without it, rank 1 would score 1.0 and rank 2 would score 0.5 - a huge gap. With , rank 1 scores and rank 2 scores - a stable, robust margin.
Documents appearing near the top of multiple ranked lists accumulate high RRF scores. Documents appearing in only one list, even at rank 1, score lower than documents consistently present across multiple lists. This property makes RRF reward robustness over cherry-picked relevance.
When to Use Multi-Queryโ
Multi-query is best for:
- Broad conceptual questions: "How does authentication work?" can legitimately retrieve from OAuth, JWT, sessions, API keys, and more.
- High-recall requirements: Research contexts where missing a relevant document is more costly than including irrelevant ones.
- Diverse terminology in corpus: If your document corpus was written by many authors using varying vocabulary, diverse query paraphrases explore the full vocabulary space.
Multi-query is less useful for:
- Narrow factual queries: "What is the rate limit?" has one answer and one neighborhood. Multiple paraphrases all point to the same document.
- Very low latency budgets: Even though vector searches run in parallel, generating paraphrases adds one LLM call.
Transformation Technique 3: Step-Back Promptingโ
The Core Ideaโ
Step-back prompting (Zheng et al., Google DeepMind, 2023) generates a more abstract, general version of a specific question. Both the specific and the abstract question are used for retrieval. The abstract question retrieves background context - foundational documents about the general principle. The specific question retrieves targeted documents about the specific instance. Together, they give the LLM the context needed to answer correctly.
The linguistic pattern:
Specific: "What is the max connection pool size for Aurora PostgreSQL 15?"
Step-back: "How does PostgreSQL connection pooling work and what limits it?"
Specific: "Why does my React hook cause infinite re-renders?"
Step-back: "How does React determine when to re-render components?"
Specific: "What is the side effect of ibuprofen on kidneys with CKD stage 3?"
Step-back: "How do NSAIDs interact with kidney function and clearance?"
The step-back question retrieves documents that establish the principle. The specific question retrieves documents about the specific case. The LLM receives both layers of context and can answer correctly even when the specific document alone would be insufficient.
When Step-Back Worksโ
Step-back is most effective for:
- Troubleshooting questions: "Why does X fail in Y situation?" requires understanding the general behavior of X before diagnosing the specific failure.
- Scientific and medical questions: Specific clinical questions almost always require understanding the underlying mechanism.
- System design questions: "Should I use sharding or replication for this use case?" requires understanding both approaches at a general level.
Transformation Technique 4: Query Decompositionโ
Sequential vs. Parallel Decompositionโ
Multi-hop questions require answering intermediate questions before the final question can be answered. Decomposition breaks the question into a dependency graph of sub-questions and executes them in order.
Sequential (chain): Each sub-question's answer informs the next.
Complex: "Who was the CEO of the company that acquired the startup that built the vector
database used in our RAG system?"
Sub-Q1: "Which vector database does our RAG system use?"
โ Answer: "Pinecone"
Sub-Q2: "Who acquired Pinecone, if anyone?"
โ Answer: "Pinecone is independent as of 2024"
Sub-Q3: "Who is Pinecone's CEO?"
โ Answer: "Edo Liberty"
Final: "Edo Liberty is the CEO of Pinecone, which built the vector database used in the system."
Parallel (fan-out): Sub-questions are independent and can execute simultaneously.
Complex: "Compare the pricing and performance of Aurora PostgreSQL vs. AlloyDB for our use case"
Sub-Q1: "What are the pricing tiers for Aurora PostgreSQL?"
Sub-Q2: "What are the pricing tiers for AlloyDB?"
Sub-Q3: "What are the performance benchmarks comparing Aurora PostgreSQL and AlloyDB?"
All three run in parallel โ synthesize.
When to Use Decompositionโ
Decomposition adds significant latency (multiple sequential LLM calls and vector searches). Use it only when:
- The question is genuinely multi-hop and cannot be answered from a single retrieval
- The answer quality justifies the latency cost
- You are in an async or batch context (not real-time chat)
Transformation Technique 5: Query Routingโ
The Meta-Level Transformationโ
Routing does not transform the query - it transforms the retrieval strategy. A classifier determines what kind of question is being asked, then dispatches to the appropriate retrieval approach. Different question types have fundamentally different optimal retrieval strategies.
| Query Type | Example | Best Strategy | Reason |
|---|---|---|---|
| FACTUAL | "What is the rate limit?" | Direct / BM25 | Exact term match outperforms semantic search for specific facts |
| PROCEDURAL | "How do I set up OAuth?" | HyDE | Register gap between how-to questions and how-to docs |
| CONCEPTUAL | "How does caching work?" | Multi-query | Need broad coverage of a topic |
| COMPARATIVE | "Which is better, A or B?" | Decomposition | Need separate retrieval for each option |
| TROUBLESHOOTING | "Why is X failing?" | Step-back | Need foundational context to diagnose specific problem |
The routing classifier is cheap - a single call to claude-haiku-4-5-20251001 with a few dozen tokens. The cost is trivial. The benefit is that each query gets the retrieval strategy it actually needs rather than a one-size-fits-all approach.
Production Codeโ
import anthropic
import asyncio
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
import json
import re
client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Core Data Structures
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@dataclass
class RetrievedChunk:
"""A single retrieved document chunk with metadata."""
chunk_id: str
content: str
score: float
source: str
rank: int = 0
@dataclass
class QueryTransformResult:
"""Result from a query transformation pipeline run."""
original_query: str
strategy_used: str
transformed_queries: list[str]
retrieved_chunks: list[RetrievedChunk]
final_answer: str
total_llm_calls: int
metadata: dict = field(default_factory=dict)
class QueryType(Enum):
FACTUAL = "FACTUAL"
PROCEDURAL = "PROCEDURAL"
CONCEPTUAL = "CONCEPTUAL"
COMPARATIVE = "COMPARATIVE"
TROUBLESHOOTING = "TROUBLESHOOTING"
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Mock Vector Store (replace with Pinecone, Weaviate, pgvector, etc.)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class MockVectorStore:
"""
Simulated vector store for demonstration.
In production: swap for your actual vector database client.
The interface contract: search(query: str, k: int) -> list[RetrievedChunk]
"""
def __init__(self):
self._documents = [
RetrievedChunk(
chunk_id="doc-001",
content=(
"Gateway timeout errors (HTTP 504) occur when the upstream server fails to respond "
"within the configured timeout window. Common causes include high server load, slow "
"database queries, and network congestion. The default timeout threshold is 30 seconds. "
"Under sustained high load, connection queuing can cause cascading 504s even when the "
"underlying service is functional."
),
score=0.0,
source="api-docs/errors.md",
),
RetrievedChunk(
chunk_id="doc-002",
content=(
"Request timeout thresholds can be configured in the client SDK using the "
"timeout_seconds parameter. For high-load scenarios, consider increasing the timeout "
"to 60-120 seconds and implementing exponential backoff with jitter. The recommended "
"backoff formula is: wait = min(cap, base * 2^attempt) + random_jitter."
),
score=0.0,
source="api-docs/configuration.md",
),
RetrievedChunk(
chunk_id="doc-003",
content=(
"Connection pool exhaustion is a frequent cause of slowdowns and timeouts under load. "
"Configure max_connections appropriately for your concurrency level. Monitor the "
"active_connections metric to detect pool starvation. When the pool is exhausted, "
"new requests queue until a connection is released, causing latency spikes."
),
score=0.0,
source="api-docs/performance.md",
),
RetrievedChunk(
chunk_id="doc-004",
content=(
"Rate limiting returns HTTP 429 Too Many Requests. Implement token bucket or "
"leaky bucket algorithms client-side to stay within rate limits. The Retry-After "
"response header specifies the required backoff duration in seconds. Exceeding rate "
"limits does not count against your account but requests are dropped."
),
score=0.0,
source="api-docs/rate-limiting.md",
),
RetrievedChunk(
chunk_id="doc-005",
content=(
"Database query optimization: use EXPLAIN ANALYZE to identify slow queries. "
"Add composite indexes on frequently filtered column combinations. Consider "
"connection pooling with PgBouncer to reduce PostgreSQL connection overhead. "
"Slow queries under load often indicate missing indexes on filter predicates."
),
score=0.0,
source="backend/database.md",
),
RetrievedChunk(
chunk_id="doc-006",
content=(
"API authentication uses Bearer token scheme with JWT. Tokens expire after 3600 "
"seconds. Refresh tokens are valid for 30 days. When a token expires, the API "
"returns HTTP 401 Unauthorized. Implement automatic token refresh in your client "
"to avoid user-visible authentication errors."
),
score=0.0,
source="api-docs/authentication.md",
),
]
def search(self, query: str, k: int = 5) -> list[RetrievedChunk]:
"""
Simulated semantic search using keyword overlap scoring.
In production: replace with real vector similarity search.
"""
query_words = set(query.lower().split())
scored = []
for doc in self._documents:
doc_words = set(doc.content.lower().split())
overlap = len(query_words & doc_words)
score = overlap / max(len(query_words), 1)
scored.append(RetrievedChunk(
chunk_id=doc.chunk_id,
content=doc.content,
score=round(score, 4),
source=doc.source,
))
scored.sort(key=lambda x: x.score, reverse=True)
results = scored[:k]
for i, chunk in enumerate(results):
chunk.rank = i + 1
return results
_vector_store = MockVectorStore()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Technique 1: HyDE Retriever
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class HyDERetriever:
"""
Hypothetical Document Embeddings retriever (Gao et al., 2022).
Core idea: embed a hypothetical answer document instead of the raw query.
The hypothetical document's embedding sits in the same geometric region
as real answer documents - not in the "question" region where the raw
query embedding lands.
Paper: https://arxiv.org/abs/2212.10496
"""
_HYPOTHESIS_PROMPT = """You are a technical documentation expert.
Given a user question, write a concise paragraph (4-6 sentences) that would appear
in technical documentation and directly answer this question. Write in a formal,
documentation-style register - declarative sentences, precise terminology.
Important: Write as if you are the documentation. Do NOT write "The answer is..." or "I think...".
If you are uncertain about specific details, use plausible but hedged technical language.
The goal is to produce text in the same style and vocabulary as real answer documents.
Question: {query}
Documentation paragraph answering this question:"""
def generate_hypothetical_document(self, query: str) -> str:
"""
Generate a hypothetical answer document using claude-haiku-4-5-20251001.
Uses the cheap/fast model because hypothetical generation is a formatting
and style task, not a deep reasoning task. The factual content of the
hypothesis does not matter - only its stylistic register matters.
"""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=350,
messages=[{
"role": "user",
"content": self._HYPOTHESIS_PROMPT.format(query=query)
}]
)
return message.content[0].text.strip()
def retrieve_with_hyde(self, query: str, k: int = 5) -> tuple[list[RetrievedChunk], str]:
"""
Execute HyDE retrieval.
Returns:
tuple of (retrieved_chunks, hypothetical_document)
Returning the hypothetical_document enables inspection and debugging.
"""
hypothetical_doc = self.generate_hypothetical_document(query)
# In production: embed hypothetical_doc with your embedding model,
# then search using that embedding vector.
# Here we use the mock store's text-based search.
chunks = _vector_store.search(hypothetical_doc, k=k)
return chunks, hypothetical_doc
def compare_hyde_vs_direct(self, query: str, k: int = 5) -> dict:
"""
Diagnostic comparison between HyDE and direct retrieval.
Run this during development to measure whether HyDE helps for your corpus.
"""
direct_chunks = _vector_store.search(query, k=k)
hyde_chunks, hypothetical_doc = self.retrieve_with_hyde(query, k=k)
def format_results(chunks: list[RetrievedChunk]) -> list[dict]:
return [
{
"rank": c.rank,
"source": c.source,
"score": c.score,
"preview": c.content[:120] + "...",
}
for c in chunks
]
return {
"query": query,
"hypothetical_document_preview": hypothetical_doc[:300] + "...",
"direct_retrieval": format_results(direct_chunks),
"hyde_retrieval": format_results(hyde_chunks),
"note": (
"If HyDE results have higher scores on more relevant sources, "
"HyDE is improving retrieval for this query type."
),
}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Technique 2: Multi-Query Retriever with RRF
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class MultiQueryRetriever:
"""
Multi-query retrieval with Reciprocal Rank Fusion (Cormack et al., 2009).
Generates N semantically diverse paraphrases, retrieves in parallel for each,
and merges ranked lists using RRF. Rewards documents that appear consistently
near the top across multiple query formulations.
RRF score: sum over all ranked lists R of 1/(k + rank_in_list_r)
where k=60 is the empirically determined RRF constant.
"""
_PARAPHRASE_PROMPT = """Generate {n} semantically diverse paraphrases of the following query.
Requirements for each paraphrase:
- Captures the exact same information need as the original
- Uses different vocabulary, synonyms, or phrasing from the other paraphrases
- Approaches the question from a slightly different angle
- Is complete and self-contained
Return ONLY a JSON array of {n} strings. No preamble, no explanation.
Query: {query}
JSON array:"""
RRF_K = 60 # Standard RRF constant (Cormack et al., 2009)
def generate_query_variants(self, query: str, n: int = 4) -> list[str]:
"""
Generate n diverse paraphrases. Always includes the original query.
Uses claude-haiku-4-5-20251001 - this is a pure formatting task.
"""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=600,
messages=[{
"role": "user",
"content": self._PARAPHRASE_PROMPT.format(query=query, n=n)
}]
)
raw = message.content[0].text.strip()
# Parse JSON with fallback
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
variants = json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
# Fallback: extract line-by-line
lines = [
l.strip().strip('"').strip("'").strip("-").strip()
for l in raw.split('\n')
if l.strip() and len(l.strip()) > 15
]
variants = lines[:n]
# Always keep original + up to n variants
return [query] + variants[:n]
def rrf_merge(
self,
ranked_lists: list[list[RetrievedChunk]],
k: int = RRF_K,
) -> list[RetrievedChunk]:
"""
Reciprocal Rank Fusion across multiple ranked document lists.
Each document accumulates 1/(k + rank) from every list in which
it appears. Consistently high-ranking documents beat single-list
outliers, even first-place outliers.
"""
rrf_scores: dict[str, float] = {}
chunk_registry: dict[str, RetrievedChunk] = {}
for ranked_list in ranked_lists:
for chunk in ranked_list:
doc_id = chunk.chunk_id
rank = chunk.rank if chunk.rank > 0 else len(ranked_list)
rrf_contribution = 1.0 / (k + rank)
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + rrf_contribution
if doc_id not in chunk_registry:
chunk_registry[doc_id] = chunk
sorted_ids = sorted(rrf_scores, key=rrf_scores.__getitem__, reverse=True)
merged = []
for i, doc_id in enumerate(sorted_ids):
chunk = chunk_registry[doc_id]
merged.append(RetrievedChunk(
chunk_id=chunk.chunk_id,
content=chunk.content,
score=round(rrf_scores[doc_id], 6),
source=chunk.source,
rank=i + 1,
))
return merged
async def _retrieve_single_async(self, query: str, k: int) -> list[RetrievedChunk]:
"""Async wrapper around synchronous vector store search."""
# In production: use your async vector store client directly.
# Most modern vector DBs (Pinecone, Weaviate, Qdrant) have async clients.
return _vector_store.search(query, k=k)
async def _retrieve_all_async(
self,
queries: list[str],
k_per_query: int = 10,
) -> list[list[RetrievedChunk]]:
"""Execute all variant searches in parallel using asyncio.gather."""
tasks = [self._retrieve_single_async(q, k_per_query) for q in queries]
results = await asyncio.gather(*tasks)
return list(results)
def retrieve_multi(self, query: str, k_per_variant: int = 10) -> dict:
"""
Complete multi-query retrieval pipeline.
1. Generate diverse query variants (1 LLM call with haiku)
2. Retrieve in parallel for all variants
3. RRF merge into final ranked list
"""
query_variants = self.generate_query_variants(query, n=4)
all_ranked_lists = asyncio.run(
self._retrieve_all_async(query_variants, k_per_query=k_per_variant)
)
# Assign per-list ranks
for ranked_list in all_ranked_lists:
for i, chunk in enumerate(ranked_list):
chunk.rank = i + 1
merged = self.rrf_merge(all_ranked_lists)
return {
"original_query": query,
"query_variants": query_variants,
"num_variants": len(query_variants),
"merged_results": merged,
"unique_docs_before_merge": sum(len(r) for r in all_ranked_lists),
"unique_docs_after_merge": len(merged),
}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Technique 3: Step-Back Retriever
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class StepBackRetriever:
"""
Step-back prompting for RAG (Zheng et al., Google DeepMind, 2023).
Generates a more abstract version of the specific question.
Retrieves on both the specific and abstract questions.
Merges results with slight preference for specific-question results.
"""
_STEPBACK_PROMPT = """Given a specific question, generate a more abstract "step-back" question
that captures the general principle, mechanism, or category underlying the specific question.
The step-back question should:
- Be broader and more general than the original
- Ask about the underlying principle or mechanism
- Be answerable from foundational reference material
Examples:
Specific: "What is the max connection pool size for Aurora PostgreSQL 15.3?"
Step-back: "How does PostgreSQL manage connection pooling, and what governs its limits?"
Specific: "Why does my React useState hook cause an infinite render loop?"
Step-back: "How does React decide when to re-render a component?"
Specific: "What is the metformin contraindication in CKD stage 3b?"
Step-back: "How does kidney function affect biguanide diabetes medication dosing and safety?"
Now generate the step-back question for:
Specific question: {query}
Step-back question (return ONLY the question, nothing else):"""
def generate_stepback_query(self, specific_query: str) -> str:
"""Generate the abstract step-back version of the specific query."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=150,
messages=[{
"role": "user",
"content": self._STEPBACK_PROMPT.format(query=specific_query)
}]
)
return message.content[0].text.strip()
def retrieve_combined(self, query: str, k: int = 5) -> dict:
"""
Retrieve using both specific and step-back queries, merge with RRF.
The specific query is weighted 2x by including it twice in RRF -
ensuring that highly specific results still score above generic ones.
"""
stepback_query = self.generate_stepback_query(query)
specific_results = _vector_store.search(query, k=k)
stepback_results = _vector_store.search(stepback_query, k=k)
for i, c in enumerate(specific_results):
c.rank = i + 1
for i, c in enumerate(stepback_results):
c.rank = i + 1
rrf = MultiQueryRetriever()
# Weight specific 2x by including its list twice
merged = rrf.rrf_merge([specific_results, specific_results, stepback_results])
return {
"original_query": query,
"stepback_query": stepback_query,
"specific_result_count": len(specific_results),
"stepback_result_count": len(stepback_results),
"merged_results": merged,
}
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Technique 4: Query Decomposer
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@dataclass
class SubQuestion:
"""A single sub-question with its dependency structure and answer."""
question: str
depends_on: list[int] = field(default_factory=list)
answer: Optional[str] = None
context_chunks: list[RetrievedChunk] = field(default_factory=list)
class QueryDecomposer:
"""
Decomposes complex multi-hop questions into sequential sub-questions.
Supports two patterns:
- Sequential: sub-question N depends on the answer to sub-question N-1
- Parallel: all sub-questions are independent (depends_on is empty for all)
Parallel sub-questions execute concurrently for efficiency.
Sequential sub-questions execute in dependency order.
"""
_DECOMPOSE_PROMPT = """Break the following complex question into 2-4 simpler sub-questions.
Each sub-question should:
- Address exactly one aspect of the complex question
- Be answerable by a targeted document retrieval
- Build toward the complete answer when combined
For each sub-question, specify which previous sub-questions (by 0-based index) must be
answered first. Use an empty list for sub-questions that can be answered independently.
Return ONLY a JSON array. Each element: {{"question": "...", "depends_on": [list of indices]}}
Complex question: {query}
JSON array:"""
_SUBQ_ANSWER_PROMPT = """Answer this specific question using ONLY the provided context.
Be precise. If the context is insufficient, say "CONTEXT_INSUFFICIENT: [what's missing]".
Previous answers (for context on dependent sub-questions):
{previous_answers}
Retrieved context:
{context}
Question: {question}
Answer:"""
_SYNTHESIS_PROMPT = """Synthesize a comprehensive final answer from the research below.
Original question: {original_question}
Research findings:
{sub_qa_pairs}
Write a complete, precise answer to the original question.
Cite specific facts from the research findings.
Do not add unsupported information."""
def decompose(self, complex_query: str) -> list[SubQuestion]:
"""Decompose a complex query into a dependency-ordered list of sub-questions."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=700,
messages=[{
"role": "user",
"content": self._DECOMPOSE_PROMPT.format(query=complex_query)
}]
)
raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
parsed = json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
# Fallback: treat as atomic question
return [SubQuestion(question=complex_query, depends_on=[])]
return [
SubQuestion(
question=item.get("question", ""),
depends_on=item.get("depends_on", []),
)
for item in parsed
if item.get("question")
]
def answer_subquestion(
self,
sub_question: SubQuestion,
answered_before: list[SubQuestion],
k: int = 5,
) -> SubQuestion:
"""Answer a single sub-question using RAG with previous answers as context."""
chunks = _vector_store.search(sub_question.question, k=k)
sub_question.context_chunks = chunks
# Collect previous answers for dependent sub-questions
prev_context_parts = []
for idx in sub_question.depends_on:
if idx < len(answered_before) and answered_before[idx].answer:
prev_context_parts.append(
f"Sub-question {idx + 1}: {answered_before[idx].question}\n"
f"Answer: {answered_before[idx].answer}"
)
prev_context = "\n\n".join(prev_context_parts) if prev_context_parts else "None"
context_text = "\n\n".join([
f"[Source: {c.source}]\n{c.content}"
for c in chunks
])
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{
"role": "user",
"content": self._SUBQ_ANSWER_PROMPT.format(
previous_answers=prev_context,
context=context_text,
question=sub_question.question,
)
}]
)
sub_question.answer = message.content[0].text.strip()
return sub_question
def execute_sequential_rag(self, complex_query: str) -> list[SubQuestion]:
"""Execute sub-questions in dependency order, passing answers forward."""
sub_questions = self.decompose(complex_query)
answered: list[SubQuestion] = []
for sq in sub_questions:
answered_sq = self.answer_subquestion(sq, answered)
answered.append(answered_sq)
return answered
def synthesize_final_answer(
self,
original_query: str,
sub_answers: list[SubQuestion],
) -> str:
"""Synthesize a final answer using claude-opus-4-6 from all sub-answers."""
sub_qa_text = "\n\n".join([
f"Sub-question {i + 1}: {sq.question}\nAnswer: {sq.answer or 'No answer retrieved.'}"
for i, sq in enumerate(sub_answers)
])
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=900,
messages=[{
"role": "user",
"content": self._SYNTHESIS_PROMPT.format(
original_question=original_query,
sub_qa_pairs=sub_qa_text,
)
}]
)
return message.content[0].text.strip()
def run(self, complex_query: str) -> QueryTransformResult:
"""Full decomposition pipeline: decompose โ answer each โ synthesize."""
sub_answers = self.execute_sequential_rag(complex_query)
final = self.synthesize_final_answer(complex_query, sub_answers)
all_chunks = [chunk for sq in sub_answers for chunk in sq.context_chunks]
return QueryTransformResult(
original_query=complex_query,
strategy_used="DECOMPOSITION",
transformed_queries=[sq.question for sq in sub_answers],
retrieved_chunks=all_chunks,
final_answer=final,
total_llm_calls=1 + len(sub_answers) + 1, # decompose + N answers + synthesis
metadata={"sub_question_count": len(sub_answers)},
)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Technique 5: Query Router
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class QueryRouter:
"""
Classifies queries by type and routes to the optimal retrieval strategy.
Classification uses claude-haiku-4-5-20251001 with a structured prompt.
"""
_CLASSIFY_PROMPT = """Classify this query into exactly one category:
FACTUAL - Asks for a specific fact, number, date, name, or definition.
PROCEDURAL - Asks how to perform a task or sequence of steps.
CONCEPTUAL - Asks for explanation of a concept, mechanism, or principle.
COMPARATIVE - Asks to compare or contrast two or more options.
TROUBLESHOOTING - Asks why something is failing or broken.
Return ONLY the category name. No explanation.
Query: {query}
Category:"""
_ROUTING_TABLE = {
QueryType.FACTUAL: "DIRECT",
QueryType.PROCEDURAL: "HYDE",
QueryType.CONCEPTUAL: "MULTI_QUERY",
QueryType.COMPARATIVE: "DECOMPOSITION",
QueryType.TROUBLESHOOTING: "STEP_BACK",
}
def classify_query(self, query: str) -> QueryType:
"""Classify query type with claude-haiku-4-5-20251001."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20,
messages=[{
"role": "user",
"content": self._CLASSIFY_PROMPT.format(query=query)
}]
)
raw = message.content[0].text.strip().upper()
try:
return QueryType(raw)
except ValueError:
return QueryType.CONCEPTUAL # Safe default
def route(self, query: str) -> tuple[str, QueryType]:
"""Return (strategy_name, query_type) for the given query."""
query_type = self.classify_query(query)
strategy = self._ROUTING_TABLE.get(query_type, "HYDE")
return strategy, query_type
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Unified Transformation Pipeline
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class TransformationPipeline:
"""
Production query transformation pipeline.
Automatically routes queries to the appropriate transformation strategy
and generates the final answer using claude-opus-4-6.
Architecture:
1. Router classifies query type (1 haiku call)
2. Appropriate transformer preprocesses query (0-1 haiku calls)
3. Vector store retrieves relevant chunks (1 parallel search)
4. claude-opus-4-6 generates the final answer (1 opus call)
"""
_ANSWER_PROMPT = """You are a precise technical assistant. Answer the question using ONLY
the provided retrieved context. Be specific and cite details from the context.
If the context is insufficient to fully answer, say what information is missing.
Retrieved context:
{context}
Question: {question}
Answer:"""
def __init__(self):
self.hyde = HyDERetriever()
self.multi_query = MultiQueryRetriever()
self.step_back = StepBackRetriever()
self.decomposer = QueryDecomposer()
self.router = QueryRouter()
def generate_answer(self, query: str, chunks: list[RetrievedChunk]) -> str:
"""Generate final answer using claude-opus-4-6 with top retrieved chunks."""
context_text = "\n\n---\n\n".join([
f"[Source: {c.source} | Rank: {c.rank} | Score: {c.score:.4f}]\n{c.content}"
for c in chunks[:5]
])
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1000,
messages=[{
"role": "user",
"content": self._ANSWER_PROMPT.format(
context=context_text,
question=query,
)
}]
)
return message.content[0].text.strip()
def execute(
self,
query: str,
force_strategy: Optional[str] = None,
) -> QueryTransformResult:
"""
Execute the full transformation pipeline for a query.
Args:
query: Raw user query
force_strategy: Override auto-routing. Options: DIRECT, HYDE,
MULTI_QUERY, STEP_BACK, DECOMPOSITION
Returns:
QueryTransformResult with answer, chunks, and metadata
"""
if force_strategy:
strategy = force_strategy
query_type = QueryType.CONCEPTUAL # placeholder when forced
else:
strategy, query_type = self.router.route(query)
if strategy == "HYDE":
chunks, hypothesis = self.hyde.retrieve_with_hyde(query, k=6)
transformed = [hypothesis]
llm_calls = 3 # router + hypothesis + answer
elif strategy == "MULTI_QUERY":
result = self.multi_query.retrieve_multi(query, k_per_variant=8)
chunks = result["merged_results"]
transformed = result["query_variants"]
llm_calls = 3 # router + variants + answer
elif strategy == "STEP_BACK":
result = self.step_back.retrieve_combined(query, k=6)
chunks = result["merged_results"]
transformed = [result["stepback_query"]]
llm_calls = 3 # router + stepback + answer
elif strategy == "DECOMPOSITION":
return self.decomposer.run(query)
else: # DIRECT
chunks = _vector_store.search(query, k=6)
transformed = [query]
llm_calls = 2 # router + answer
answer = self.generate_answer(query, chunks)
return QueryTransformResult(
original_query=query,
strategy_used=strategy,
transformed_queries=transformed,
retrieved_chunks=chunks,
final_answer=answer,
total_llm_calls=llm_calls,
metadata={
"query_type": query_type.value,
"chunks_retrieved": len(chunks),
},
)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Demo
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def run_demo():
"""Smoke test across query types."""
pipeline = TransformationPipeline()
test_cases = [
# Should route โ STEP_BACK (troubleshooting)
("I'm getting 504s when the load is high on my API calls", None),
# Should route โ MULTI_QUERY (conceptual)
("How does connection pooling work in database-backed services?", None),
# Should route โ FACTUAL โ DIRECT
("What is the default request timeout in seconds?", None),
# Force HYDE for demonstration
("Why does my API integration keep failing under load?", "HYDE"),
]
for query, force in test_cases:
print(f"\n{'=' * 65}")
result = pipeline.execute(query, force_strategy=force)
print(f"Query: {query[:70]}")
print(f"Strategy: {result.strategy_used}")
print(f"Transforms: {[t[:60] for t in result.transformed_queries]}")
print(f"LLM calls: {result.total_llm_calls}")
print(f"Top source: {result.retrieved_chunks[0].source if result.retrieved_chunks else 'none'}")
print(f"Answer: {result.final_answer[:200]}...")
if __name__ == "__main__":
run_demo()
Architecture: Query Routing Overviewโ
HyDE Mechanism in Detailโ
Multi-Query RRF Fusionโ
Production Engineering Notesโ
Latency Budget by Strategyโ
| Strategy | LLM Calls | Vector Searches | Estimated Latency |
|---|---|---|---|
| DIRECT | 1 (answer) | 1 sequential | ~700ms |
| HYDE | 2 (hypothesis + answer) | 1 sequential | ~1,100ms |
| MULTI_QUERY | 2 (variants + answer) | N parallel | ~1,300ms |
| STEP_BACK | 2 (stepback + answer) | 2 parallel | ~1,200ms |
| DECOMPOSITION | 1 + N + 1 | N sequential | ~3,000-8,000ms |
Decomposition is not suitable for real-time interactive applications. Use it for asynchronous research workflows, document analysis pipelines, or queries where the user explicitly requests comprehensive research.
Caching for Cost and Latency Reductionโ
Cache at three levels:
-
Route cache: Same query โ same classification. Cache router output by query hash with a 24-hour TTL. Saves one haiku call per repeated query.
-
Transform cache: The same query generates the same hypothetical document or the same set of variants. Cache transformations by query hash. HyDE hypotheticals can be cached indefinitely for stable corpora.
-
Retrieval cache: If the same query is asked frequently, cache the full retrieval result. TTL depends on document update frequency - use short TTLs (minutes) for frequently updated corpora, longer TTLs (hours) for stable documentation.
Cost Modelโ
Strategy cost per query (rough estimates at March 2026 pricing):
DIRECT:
haiku router: ~50 tokens = $0.000006
opus-4-6 answer: ~500 tokens = $0.0075
Total: ~$0.0075
HYDE:
haiku router: ~50 tokens = $0.000006
haiku hypothesis: ~300 tokens = $0.000038
opus-4-6 answer: ~500 tokens = $0.0075
Total: ~$0.0076 (+1.5% vs. DIRECT)
MULTI_QUERY:
haiku router: ~50 tokens = $0.000006
haiku variants: ~400 tokens = $0.000050
opus-4-6 answer: ~500 tokens = $0.0075
Total: ~$0.0076 (+1.5% vs. DIRECT)
DECOMPOSITION:
haiku router: ~50 tokens = $0.000006
haiku decompose: ~500 tokens = $0.000063
haiku x3 sub-answers: ~900 tokens = $0.000113
opus-4-6 synthesis: ~800 tokens = $0.0120
Total: ~$0.013 (+73% vs. DIRECT)
The additional cost of most transformation strategies is negligible. Only decomposition adds meaningful cost - and only because the synthesis requires a longer opus-4-6 call.
:::tip Default Strategy Recommendation
For new RAG systems with unknown query distributions, default to HyDE for all queries. It adds negligible cost, is broadly effective across query types, and is easy to implement. Add routing later once you have enough production data to know which query types dominate your distribution.
:::
:::warning HyDE Degrades Factual Retrieval Over Proprietary Data
If your corpus contains precise proprietary data - internal financial figures, specific model outputs, private research findings - HyDE can degrade retrieval. The hypothetical document contains plausible but incorrect numbers, pulling the search vector toward wrong facts.
For factual queries over proprietary corpora, disable HyDE and use direct BM25 or exact-match retrieval. Implement a factual query detector using simple heuristics (queries containing numbers, named entities, "what is the [specific thing]" patterns) to bypass HyDE.
:::
:::danger Decomposition Dependency Cycle Detection
Query decomposition can produce circular dependencies: Sub-Q2 depends on Sub-Q3 which depends on Sub-Q2. Before executing, validate the dependency graph:
- Build a directed graph of sub-question dependencies
- Run topological sort - if it fails (cycle detected), fall back to sequential execution in declaration order
- Cap maximum decomposition depth at 4 levels
- Cap maximum sub-questions at 5
Without these guards, malformed decompositions can produce infinite loops or deeply nested sequential retrieval chains.
:::
Interview Q&Aโ
Q1: What is HyDE and what problem does it solve geometrically?โ
HyDE (Hypothetical Document Embeddings, Gao et al. 2022) solves the geometric mismatch between questions and answers in dense retrieval. When you embed a user's question, the resulting vector sits in the "question region" of the embedding space - near other questions and search queries. Real answer documents sit in a different region - near other declarative statements, formal documentation, and explanatory text. This geometric gap means that naive dense retrieval finds things that look like the question, not things that answer it.
HyDE bridges this gap by generating a hypothetical document that would answer the question - using an LLM with a documentation-style prompt. This generated text, even if factually wrong in its specifics, has the stylistic register and vocabulary of real answer documents. Its embedding lands in the answer region of semantic space, close to real documents that correctly answer the question.
You would not use HyDE for (1) factual queries over proprietary data where the LLM cannot generate an accurate hypothesis and a wrong hypothesis actively misguides retrieval, (2) highly ambiguous queries where collapsing ambiguity into a single hypothesis may pick the wrong direction, and (3) extremely latency-sensitive paths where even one additional haiku call exceeds the budget.
Q2: Explain Reciprocal Rank Fusion. Why is the constant k=60, and what happens at k=0?โ
RRF (Cormack et al., 2009) merges multiple ranked document lists. For each document, it sums across all ranked lists . Documents that appear consistently near the top across many lists accumulate high scores. Documents appearing in only one list score lower.
The constant was empirically determined to work well across diverse retrieval tasks. Its role is rank damping - controlling how steeply the score drops as rank increases. At , rank 1 scores 1.0 and rank 2 scores 0.5: a 2x difference. This makes the merge hypersensitive to small rank differences. At , rank 1 scores and rank 2 scores : a difference of 0.16%. The merge becomes stable and robust - slight rank differences between lists matter very little, but consistent high placement across many lists matters a lot.
For production systems with uniform list sizes and well-tuned retrievers, works reliably. If you have very high-quality ranked lists where rank 1 genuinely means something special, reducing slightly (to 30-45) gives more weight to top positions.
Q3: Describe a production scenario where query decomposition significantly outperforms single-shot RAG.โ
A regulatory compliance system handles the question: "What are the overlapping disclosure obligations between the SEC's 2023 climate rules and the EU's CSRD for companies listed in both markets?" Single-shot RAG retrieves documents about either SEC climate rules or CSRD, but rarely both, and cannot perform the intersection analysis.
Decomposition would: (1) retrieve SEC 2023 climate disclosure requirements, (2) retrieve EU CSRD disclosure requirements, (3) identify the reporting scope differences (US public companies vs. large EU companies), (4) analyze the overlapping content areas where both frameworks require similar disclosures, and (5) synthesize the implications for dual-listed companies. Each step retrieves precisely targeted information. The synthesis has all the raw material needed for a complete, accurate answer.
Single-shot RAG on this question produces a generic overview of both regulatory frameworks that misses the specific intersection analysis entirely.
Q4: How would you measure whether query transformation is actually improving your RAG system in production?โ
You need both offline evaluation and online measurement. For offline evaluation, build a test set of (query, relevant document IDs, ground truth answer) triples. For each test query, measure recall@K - what fraction of relevant document IDs appear in the top K retrieved results - with and without transformation. Run this across a representative sample of your query distribution, not cherry-picked examples.
For online measurement, implement A/B testing: route a fraction of production queries to the transformed retriever and the rest to the baseline. Measure downstream outcomes: answer satisfaction ratings (thumbs up/down), user follow-up question rate (a proxy for answer completeness), task completion rate (did the user accomplish what they came to do?), and session abandonment rate.
Transformation always adds latency. Report the latency cost alongside the quality gain. The decision to ship a transformation strategy must weigh both. A 5% quality improvement that doubles latency from 500ms to 1,000ms may not be worth it for a real-time customer support interface, but is absolutely worth it for an async research assistant.
Q5: A user asks "why is everything broken?" How should your pipeline handle this?โ
This query is maximally vague and maps to QueryType.TROUBLESHOOTING, which routes to STEP_BACK. But step-back on "everything is broken" cannot generate a useful specific step-back question - the original is too underspecified.
The correct production handling depends on the interface:
Interactive chat: Detect underspecification (query length under 6 words, no domain-specific terms) and request clarification before retrieval: "To help diagnose the issue, which component or system is affected - the API, the dashboard, the authentication system, or something else?"
Automated pipeline: Use multi-query with a broad set of variants covering the most common failure categories for your system. Retrieve across all categories, generate an answer that acknowledges ambiguity explicitly: "Without knowing which component is affected, here are the most common causes of system failures in this platform: [list with links]." This is worse than a targeted answer but better than a confidently wrong answer.
Proactive detection: Train a query quality classifier to detect underspecified queries before routing. Queries with zero domain-specific terms, no verbs describing specific behaviors, or under a token length threshold should be flagged for clarification rather than attempted retrieval.
Q6: What is step-back prompting and how does it differ from query expansion?โ
Step-back prompting generates a more abstract version of a specific query and retrieves on both. Query expansion generates a more specific or broader vocabulary version of the same query.
The direction is opposite. Query expansion adds synonyms, related terms, and alternative phrasings at the same abstraction level: "connection pool" โ "connection pool OR database connections OR connection limit OR pool size." Step-back goes up the abstraction hierarchy: "Why does Aurora PostgreSQL timeout under load?" โ "How does PostgreSQL manage connections under high concurrency?"
The practical difference: query expansion helps when the vocabulary gap is lexical (synonyms, abbreviations, spelling variants). Step-back helps when the question requires background context to answer correctly - the answer to the general principle illuminates the specific case.
Both are useful; they address different failure modes. A complete transformation pipeline may apply query expansion first (to handle lexical variants) and step-back retrieval on top (to add foundational context).
Summaryโ
Query transformation is non-optional for production RAG systems serving diverse user populations. The vocabulary mismatch between how users ask questions and how documents explain answers is structural and inevitable. These five techniques address distinct failure modes:
| Technique | Failure Mode Addressed | Cost | Best For |
|---|---|---|---|
| HyDE | Register gap: question embeds far from answer | +1 haiku call | Technical docs, formal corpora |
| Multi-query + RRF | Narrow coverage from single query | +1 haiku call | Broad questions, diverse terminology |
| Step-back | Missing foundational context | +1 haiku call | Troubleshooting, scientific questions |
| Decomposition | Multi-hop: one retrieval can't answer | +N haiku calls | Complex comparative, multi-facet |
| Routing | Wrong strategy for query type | +1 haiku call | Diverse query distributions |
Start with HyDE as your default. It is cheap, broadly effective, and easy to validate. Add routing once you have production query data that reveals which types dominate. Add decomposition for complex use cases where the latency cost is acceptable. Measure everything - query transformation without measurement is engineering by guess.
