Benchmark Results - AI Letters #32

Notebook #24 - LLM Evaluation Framework Showdown. 3 test cases, 3 frameworks, heuristic evaluation (no API key required).

Heuristic Evaluation Scores by Test Case No LLM API required - uses word overlap

Scenario

Faithfulness

Relevancy

Mean Score

Faithful

Expected: HIGH

0.52

52% words in context

0.33

query coverage

0.43

mean

Unfaithful

Expected: LOW

0.19

contradicts context

0.33

query coverage

0.26

mean

Off-Topic

Expected: ZERO

0.00

no context overlap

0.00

no query overlap

0.00

mean

Heuristic Evaluation Works

Word overlap alone separates the three cases: 0.43 vs 0.26 vs 0.00. You do not need GPT-4-as-judge to detect when a response has zero overlap with retrieved context. For high-volume production monitoring, heuristic checks are fast and free.

LlamaIndex: Best for Existing Users

6/7 features, 19 LoC. BatchEvalRunner with workers=N handles concurrency correctly. Standardized EvaluationResult interface means custom metrics compose cleanly. Only gap: no regression tracking between eval runs.

LangChain: Evaluation is Your Problem

2/7 features. The langchain.evaluation module was deliberately removed in 1.x. LangChain is an orchestration framework, not an evaluation framework. Teams relying on it for RAG evaluation either do not know this or have not replaced it.

Regression Tracking Changes Everything

Point-in-time faithfulness scores are useful. Tracked-over-time scores are what you can build a deployment gate on. EvalSnapshot + EvalRegression (SynapseKit only) is the feature that turns evaluation from a pre-launch checklist into continuous quality monitoring.