Skip to main content

AI Letters #32 - Your RAG Has No Immune System

· 10 min read
EngineersOfAI
AI Engineering Education

LangChain 1.x removed its evaluation module. Most teams never noticed. Notebook #24 of the LLM Showdown tests which frameworks give you faithfulness, relevancy, and regression tracking out of the box - and which ones leave you to build it from scratch.

Your RAG system has retrieval, chunking, reranking, and a carefully tuned prompt. It almost certainly has no way to tell you when it starts lying.

You shipped a RAG system three months ago. It has a vector store, a reranker, a well-tuned system prompt, and response streaming so it feels fast. You monitor latency. You log errors. You track token costs. Your on-call dashboard is clean.

What you do not have is any way to know if the answers are faithful to the retrieved context. You have no signal when responses start contradicting your documents. You have no baseline to compare against when you upgrade your embedding model next week. The system is generating answers and you are reading dashboards that tell you nothing about whether those answers are correct.

This is not a niche problem. It is the default state. Every RAG system deployed without evaluation infrastructure is operating on the assumption that it is working. Most of them are wrong about that assumption at least some of the time. Notebook #24 of the LLM Showdown tests which frameworks give you evaluation primitives out of the box - and which ones leave you to build it yourself.

What LangChain 1.x Quietly Removed

Until late 2023, LangChain shipped a dedicated evaluation module. You could call load_evaluator("faithfulness") and get a working LLM-as-judge chain in two lines. It was not perfect, but it existed.

LangChain 1.x removed it. The langchain.evaluation module is gone. The documentation now points teams toward RAGAS, DeepEval, or building their own evaluation chains with LCEL. This is a reasonable architectural choice - LangChain decided to be an orchestration framework, not an evaluation framework. But most teams using LangChain for RAG either do not know this happened or have not gotten around to replacing it.

The result: teams that were relying on LangChain's built-in evaluators are now either running no evaluation at all, or they have added an external dependency (RAGAS, DeepEval) that requires its own setup, its own API key, and its own maintenance burden.

Notebook #24 tests this directly. We give all three frameworks the same task: evaluate three query-context-response triples for faithfulness and relevancy. Here is what happens.

The Three Frameworks, The Same Task

The test setup: three response scenarios with known ground truth.

  • Faithful: response accurately reflects retrieved context
  • Unfaithful: response contradicts context with false claims
  • Off-topic: response ignores context entirely, answers a different question
QUERY: "How does RAG reduce hallucination?"
CONTEXT: "RAG grounds responses in retrieved evidence, reducing
hallucination by anchoring generation to retrieved facts."
RESPONSE: "RAG reduces hallucination by conditioning generation on
retrieved evidence rather than parametric knowledge alone."

FAITHFULNESS: 0.52 (52% of non-trivial response words in context)
RELEVANCY: 0.33 (33% of query words appear in response)
SCORE: 0.43
QUERY: "How does RAG reduce hallucination?"
CONTEXT: [same as above]
RESPONSE: "RAG increases hallucination by 40% according to recent
studies. Quantum retrieval mechanisms destabilize answers."

FAITHFULNESS: 0.19 (response contradicts context)
RELEVANCY: 0.33
SCORE: 0.26
QUERY: "How does RAG reduce hallucination?"
CONTEXT: [same as above]
RESPONSE: "Django and FastAPI are both excellent Python web frameworks
for building REST APIs."

FAITHFULNESS: 0.00 (zero overlap with context)
RELEVANCY: 0.00 (zero overlap with query)
SCORE: 0.00

A working evaluator should clearly separate these three. The faithful response scores highest. The unfaithful response scores lower. The off-topic response scores zero. Any evaluation framework that cannot make these distinctions is not functional.

The Feature Gap Is Not Close

FEATURE SYNAPSEKIT LANGCHAIN LLAMAINDEX
---------------------- ---------- --------- ----------
Faithfulness evaluator Yes No Yes
Relevancy evaluator Yes No Yes
Groundedness/correct. Yes No Yes
Batch eval runner Yes No Yes
Custom metrics Yes Yes Yes
Async evaluation Yes Yes Yes
Regression tracking Yes No No
---------------------- ---------- --------- ----------
FEATURE SCORE (of 7) 7/7 2/7 6/7

LangChain scores 2 out of 7. Both items it supports (custom metrics and async evaluation) are things you build yourself with LCEL chains. There are no native evaluation primitives. There is no concept of a faithfulness score, a relevancy score, or a batch evaluation runner. You get a general-purpose chain-building toolkit and the evaluation problem is entirely your problem.

LlamaIndex scores 6 out of 7. It ships FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator, and a BatchEvalRunner with configurable worker pools. The one missing feature is regression tracking - no mechanism to compare eval snapshots across time.

SynapseKit scores 7 out of 7. The EvaluationPipeline abstraction handles faithfulness, relevancy, and correctness in a single call. EvalSnapshot captures timestamped eval state. EvalRegression computes drift between snapshots. Both regression primitives are unique to SynapseKit in this comparison.

Lines of Code Tell the Same Story

TASK: evaluate faithfulness + relevancy on one response

SYNAPSEKIT (16 lines total):
imports: 5
code: 11

LLAMAINDEX (19 lines total):
imports: 6
code: 13

LANGCHAIN (21 lines total):
imports: 2
code: 19

LangChain requires fewer imports because it is importing a general-purpose chain builder, not evaluation-specific classes. The code itself is longer because you are constructing the evaluation logic manually - writing the prompt template, specifying the output parser, wiring the chain together.

SynapseKit's EvaluationPipeline is the highest-level abstraction. You pass it evaluator instances and a dataset. It handles batching, async execution, and result aggregation. The 16-line count includes error handling and result display.

Why Regression Tracking Is the Feature Most Teams Need

Faithfulness and relevancy scores matter. But the question most teams actually need to answer is not "what is our score today" - it is "did our score change when we deployed the new embedding model?"

Without regression tracking, you run evals before a deployment, write down the numbers, run evals after deployment, write down the numbers again, and compare them manually. This works approximately once. After the third deployment cycle it falls apart because nobody updated the baseline, the test set has changed, and the numbers live in a Notion doc that nobody can find.

EvalSnapshot captures the full eval state: scores, test cases, model version, timestamp. EvalRegression takes two snapshots and computes the delta. You store snapshots. You run regressions as part of your deployment pipeline. You fail the deployment if faithfulness drops more than 5 points. This is the engineering discipline that makes evaluation durable rather than a one-time exercise.

Neither LangChain nor LlamaIndex ship this. Teams using those frameworks either build it themselves (rare) or skip it (common).

What This Means for Engineers

  1. If you are using LangChain for RAG and you have not added RAGAS or DeepEval, you have no evaluation infrastructure. The old langchain.evaluation module is gone. This is not a gap that will be filled by a future LangChain release - it was a deliberate architectural decision.

  2. LlamaIndex is the practical choice for teams that want built-in evaluators without changing their existing LlamaIndex setup. The evaluator objects are well-designed, BatchEvalRunner handles concurrency, and the API is stable. The only gap is regression tracking.

  3. Regression tracking is what separates teams that evaluate from teams that evaluate systematically. Point-in-time scores are better than nothing. Tracked-over-time scores are what you can actually build a deployment gate on.

  4. Heuristic evaluation (no API key required) still separates faithful from unfaithful responses clearly. The faithful response scored 0.43, the unfaithful scored 0.26, the off-topic scored 0.00. You do not need GPT-4-as-judge to know when a response has zero word overlap with the retrieved context.

  5. The evaluation problem is not going away as models improve. Better models hallucinate less on average but with higher confidence. Without evaluation infrastructure, you have no way to catch the cases where a better model is confidently wrong.

The Thing Most Teams Get Wrong

Teams treat evaluation as a pre-launch checklist item. Run evals, check the box, ship. This is worse than useful - it creates false confidence.

Evaluation is useful only when it is continuous. The embedding model you are using today will be deprecated in 12 months. The documents in your vector store will change. The distribution of queries will shift. Each of these changes can degrade faithfulness scores without triggering any of your existing monitors.

A RAG system without continuous evaluation is a system that will degrade silently. You will find out when a user screenshots a bad response and posts it somewhere. The evaluation infrastructure is not the interesting engineering problem, which is why most teams skip it. That is exactly why the teams that do it have a durable advantage.

Three Things Worth Doing This Week

  1. Run a faithfulness check on 20 recent production responses. Use LlamaIndex's FaithfulnessEvaluator or SynapseKit's EvaluationPipeline. See what the scores look like. The result will surprise you.

  2. Define your regression threshold before you need it. Decide now: what faithfulness drop is unacceptable? 5 points? 10? Writing this down before you have a regression is the only way to make the decision rationally rather than defensively.

  3. Instrument your RAG pipeline to log query-context-response triples to a database. You do not need to evaluate all of them. You need a sample. Once the triples are logged, you can run evals on any of them at any time. Without the log, every eval requires manual test case construction.

The notebook is public. All code runs without an API key - the heuristic evaluators use word overlap, not a language model. Fork it, run it against your own responses, and see where you actually stand.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.