Notebook #24 - LLM Evaluation Framework Showdown. 3 test cases, 3 frameworks, heuristic evaluation (no API key required).
Lines of Code Comparison
Faithfulness + relevancy evaluation on 3 test cases
Feature Coverage (of 7)
Faithfulness, relevancy, groundedness, batch runner, custom metrics, async, regression tracking
Score Separation Across Test Cases
A working evaluator must clearly separate faithful (high) from unfaithful (medium) from off-topic (zero)
Heuristic Evaluation Works
Word overlap alone separates the three cases: 0.43 vs 0.26 vs 0.00. You do not need GPT-4-as-judge to detect when a response has zero overlap with retrieved context. For high-volume production monitoring, heuristic checks are fast and free.
LlamaIndex: Best for Existing Users
6/7 features, 19 LoC. BatchEvalRunner with workers=N handles concurrency correctly. Standardized EvaluationResult interface means custom metrics compose cleanly. Only gap: no regression tracking between eval runs.
LangChain: Evaluation is Your Problem
2/7 features. The langchain.evaluation module was deliberately removed in 1.x. LangChain is an orchestration framework, not an evaluation framework. Teams relying on it for RAG evaluation either do not know this or have not replaced it.
Regression Tracking Changes Everything
Point-in-time faithfulness scores are useful. Tracked-over-time scores are what you can build a deployment gate on. EvalSnapshot + EvalRegression (SynapseKit only) is the feature that turns evaluation from a pre-launch checklist into continuous quality monitoring.