Benchmark Results

Notebook #24 - LLM Evaluation Framework Showdown. 3 test cases, 3 frameworks, heuristic evaluation (no API key required).

7/7
SynapseKit feature score
6/7
LlamaIndex feature score
2/7
LangChain feature score
1
Framework with regression tracking
Lines of Code Comparison
Faithfulness + relevancy evaluation on 3 test cases
Feature Coverage (of 7)
Faithfulness, relevancy, groundedness, batch runner, custom metrics, async, regression tracking
Heuristic Evaluation Scores by Test Case No LLM API required - uses word overlap
Scenario
Faithfulness
Relevancy
Mean Score
Faithful
Expected: HIGH
0.52
52% words in context
0.33
query coverage
0.43
mean
Unfaithful
Expected: LOW
0.19
contradicts context
0.33
query coverage
0.26
mean
Off-Topic
Expected: ZERO
0.00
no context overlap
0.00
no query overlap
0.00
mean
Score Separation Across Test Cases
A working evaluator must clearly separate faithful (high) from unfaithful (medium) from off-topic (zero)
Heuristic Evaluation Works
Word overlap alone separates the three cases: 0.43 vs 0.26 vs 0.00. You do not need GPT-4-as-judge to detect when a response has zero overlap with retrieved context. For high-volume production monitoring, heuristic checks are fast and free.
LlamaIndex: Best for Existing Users
6/7 features, 19 LoC. BatchEvalRunner with workers=N handles concurrency correctly. Standardized EvaluationResult interface means custom metrics compose cleanly. Only gap: no regression tracking between eval runs.
LangChain: Evaluation is Your Problem
2/7 features. The langchain.evaluation module was deliberately removed in 1.x. LangChain is an orchestration framework, not an evaluation framework. Teams relying on it for RAG evaluation either do not know this or have not replaced it.
Regression Tracking Changes Everything
Point-in-time faithfulness scores are useful. Tracked-over-time scores are what you can build a deployment gate on. EvalSnapshot + EvalRegression (SynapseKit only) is the feature that turns evaluation from a pre-launch checklist into continuous quality monitoring.
www.engineersofai.com - AI Letters #32