Framework Feature Matrix

LLM evaluation primitives - SynapseKit vs LangChain vs LlamaIndex (Notebook #24)

Click any row to see how each framework implements that feature, with code snippets.
Lines of Code: Faithfulness + Relevancy Evaluation
Fewer lines = less boilerplate. Both tasks evaluated on the same 3 test cases.
SynapseKit16 lines
LlamaIndex19 lines
LangChain21 lines
Feature
SynapseKit
LangChain
LlamaIndex
Faithfulness Evaluator
SynapseKit - built in
from synapsekit.eval import ( FaithfulnessEvaluator, EvaluationPipeline ) ev = FaithfulnessEvaluator() pipeline = EvaluationPipeline( evaluators=[ev] )
Heuristic mode works without an LLM API key. LLM mode uses a judge model. Composable with batch pipeline.
LangChain - NOT built in
# langchain.evaluation removed in 1.x # Use RAGAS instead: from ragas.metrics import faithfulness # Or build with LCEL: from langchain_core.prompts import ( ChatPromptTemplate ) # Write your own judge prompt...
The old load_evaluator("faithfulness") is gone. Teams must add RAGAS/DeepEval as an external dependency or write their own judge chain.
LlamaIndex - built in
from llama_index.core.evaluation import ( FaithfulnessEvaluator ) ev = FaithfulnessEvaluator(llm=llm) result = await ev.aevaluate( query=query, response=response, contexts=contexts, )
Returns EvaluationResult with score, passing flag, and feedback. Requires an LLM instance to act as judge.
Relevancy Evaluator
SynapseKit
from synapsekit.eval import ( RelevancyEvaluator ) ev = RelevancyEvaluator( mode="heuristic" # mode="llm" for semantic )
Heuristic: word overlap fraction. LLM: semantic relevancy scoring with a judge model prompt.
LangChain - build it yourself
prompt = ChatPromptTemplate.from_template( "Rate how relevant this response " "is to the query. 0.0 to 1.0.\n" "Query: {query}\n" "Response: {response}\n" "Score:" ) chain = prompt | llm | StrOutputParser()
No native relevancy evaluator. Each team builds their own judge prompt, which means inconsistent rubrics across codebases.
LlamaIndex
from llama_index.core.evaluation import ( RelevancyEvaluator ) ev = RelevancyEvaluator(llm=llm) result = await ev.aevaluate( query=query, response=response, contexts=contexts )
Same interface as FaithfulnessEvaluator. Fully composable with BatchEvalRunner for multi-metric evaluation runs.
Batch Eval Runner
SynapseKit
results = await pipeline.evaluate_batch( dataset=test_cases, evaluators=[ FaithfulnessEvaluator(), RelevancyEvaluator(), ] ) # Returns aggregate stats # + per-sample breakdown
Handles async concurrency automatically. Returns EvalResults with aggregate stats and per-sample detail in one call.
LangChain - loop manually
# No BatchEvalRunner results = [] for case in test_cases: r = await chain.ainvoke({ "query": case.query, "response": case.response, }) results.append(r) # Aggregate results yourself
No batch infrastructure. Teams write their own loops, handle rate limiting, and aggregate results. Leads to inconsistent eval tooling across projects.
LlamaIndex
from llama_index.core.evaluation import ( BatchEvalRunner ) runner = BatchEvalRunner( evaluators={ "faith": FaithfulnessEvaluator(), "relev": RelevancyEvaluator(), }, workers=4 # concurrent ) eval_results = await runner.aevaluate_queries( queries=queries, ... )
workers=N runs N concurrent evaluations. Returns dict keyed by evaluator name. Most complete batch infrastructure of the three frameworks.
Regression Tracking UNIQUE
SynapseKit - only framework with this
from synapsekit.eval import ( EvalSnapshot, EvalRegression ) snap_v1 = EvalSnapshot.capture( results=results_before, tag="embedding-v1" ) snap_v2 = EvalSnapshot.capture( results=results_after, tag="embedding-v2" ) reg = EvalRegression.compare( snap_v1, snap_v2 ) # reg.faithfulness_delta = -0.08 # reg.passed = False # gate failed
EvalSnapshot stores timestamped eval state. EvalRegression computes drift between snapshots. Use as a deployment gate: fail if faithfulness drops more than threshold.
LangChain - not available
# No regression tracking primitives # Teams either: # 1. Track manually in spreadsheets # 2. Build their own snapshot system # 3. Use external MLflow/W&B logging # 4. Skip regression tracking entirely # (option 4 is most common)
No regression infrastructure. Most teams skip it, which means evaluation becomes a one-time exercise rather than a continuous quality gate.
LlamaIndex - not available
# No EvalSnapshot equivalent # BatchEvalRunner returns results # but does not persist or compare # them across runs # Teams must build their own # snapshot + comparison logic # using the raw result dicts
LlamaIndex has the best batch evaluation of the three but lacks regression tracking. Teams using LlamaIndex must build their own snapshot persistence layer.
Custom Metrics
SynapseKit
from synapsekit.eval import BaseEvaluator class ToxicityEvaluator(BaseEvaluator): async def aevaluate( self, query, response, contexts ): score = await self._judge(response) return EvalResult(score=score)
Extend BaseEvaluator. Async-native by design. Plugs directly into EvaluationPipeline and batch runner.
LangChain
# Any LCEL chain can be a metric from langchain_core.runnables import ( RunnableLambda ) def score_toxicity(inputs): return {"score": detect_toxicity( inputs["response"] )} chain = RunnableLambda(score_toxicity)
Maximum flexibility - any LCEL chain becomes a metric. No standardized return format, which makes aggregation harder across multiple metrics.
LlamaIndex
from llama_index.core.evaluation import ( BaseEvaluator, EvaluationResult ) class ToxicityEvaluator(BaseEvaluator): async def aevaluate( self, query=None, response=None, contexts=None, **kwargs ) -> EvaluationResult: ...
Extend BaseEvaluator. Standardized EvaluationResult return type means custom metrics compose seamlessly with BatchEvalRunner.
Feature Score (of 7)
7/7
SynapseKit
2/7
LangChain
6/7
LlamaIndex
www.engineersofai.com - AI Letters #32