Framework Feature Matrix - AI Letters #32

LLM evaluation primitives - SynapseKit vs LangChain vs LlamaIndex (Notebook #24)

Feature

SynapseKit

LangChain

LlamaIndex

Faithfulness Evaluator

✓

✗

✓

SynapseKit - built in

from synapsekit.eval import ( FaithfulnessEvaluator, EvaluationPipeline ) ev = FaithfulnessEvaluator() pipeline = EvaluationPipeline( evaluators=[ev] )

Heuristic mode works without an LLM API key. LLM mode uses a judge model. Composable with batch pipeline.

LangChain - NOT built in

# langchain.evaluation removed in 1.x # Use RAGAS instead: from ragas.metrics import faithfulness # Or build with LCEL: from langchain_core.prompts import ( ChatPromptTemplate ) # Write your own judge prompt...

The old load_evaluator("faithfulness") is gone. Teams must add RAGAS/DeepEval as an external dependency or write their own judge chain.

LlamaIndex - built in

from llama_index.core.evaluation import ( FaithfulnessEvaluator ) ev = FaithfulnessEvaluator(llm=llm) result = await ev.aevaluate( query=query, response=response, contexts=contexts, )

Returns EvaluationResult with score, passing flag, and feedback. Requires an LLM instance to act as judge.

Relevancy Evaluator

✓

✗

✓

SynapseKit

from synapsekit.eval import ( RelevancyEvaluator ) ev = RelevancyEvaluator( mode="heuristic" # mode="llm" for semantic )

Heuristic: word overlap fraction. LLM: semantic relevancy scoring with a judge model prompt.

LangChain - build it yourself

prompt = ChatPromptTemplate.from_template( "Rate how relevant this response " "is to the query. 0.0 to 1.0.\n" "Query: {query}\n" "Response: {response}\n" "Score:" ) chain = prompt | llm | StrOutputParser()

No native relevancy evaluator. Each team builds their own judge prompt, which means inconsistent rubrics across codebases.

LlamaIndex

from llama_index.core.evaluation import ( RelevancyEvaluator ) ev = RelevancyEvaluator(llm=llm) result = await ev.aevaluate( query=query, response=response, contexts=contexts )

Same interface as FaithfulnessEvaluator. Fully composable with BatchEvalRunner for multi-metric evaluation runs.

Batch Eval Runner

✓

✗

✓

SynapseKit

results = await pipeline.evaluate_batch( dataset=test_cases, evaluators=[ FaithfulnessEvaluator(), RelevancyEvaluator(), ] ) # Returns aggregate stats # + per-sample breakdown

Handles async concurrency automatically. Returns EvalResults with aggregate stats and per-sample detail in one call.

LangChain - loop manually

# No BatchEvalRunner results = [] for case in test_cases: r = await chain.ainvoke({ "query": case.query, "response": case.response, }) results.append(r) # Aggregate results yourself

No batch infrastructure. Teams write their own loops, handle rate limiting, and aggregate results. Leads to inconsistent eval tooling across projects.

LlamaIndex

from llama_index.core.evaluation import ( BatchEvalRunner ) runner = BatchEvalRunner( evaluators={ "faith": FaithfulnessEvaluator(), "relev": RelevancyEvaluator(), }, workers=4 # concurrent ) eval_results = await runner.aevaluate_queries( queries=queries, ... )

workers=N runs N concurrent evaluations. Returns dict keyed by evaluator name. Most complete batch infrastructure of the three frameworks.

Regression Tracking UNIQUE

✓

✗

SynapseKit - only framework with this

from synapsekit.eval import ( EvalSnapshot, EvalRegression ) snap_v1 = EvalSnapshot.capture( results=results_before, tag="embedding-v1" ) snap_v2 = EvalSnapshot.capture( results=results_after, tag="embedding-v2" ) reg = EvalRegression.compare( snap_v1, snap_v2 ) # reg.faithfulness_delta = -0.08 # reg.passed = False # gate failed

EvalSnapshot stores timestamped eval state. EvalRegression computes drift between snapshots. Use as a deployment gate: fail if faithfulness drops more than threshold.

LangChain - not available

# No regression tracking primitives # Teams either: # 1. Track manually in spreadsheets # 2. Build their own snapshot system # 3. Use external MLflow/W&B logging # 4. Skip regression tracking entirely # (option 4 is most common)

No regression infrastructure. Most teams skip it, which means evaluation becomes a one-time exercise rather than a continuous quality gate.

LlamaIndex - not available

# No EvalSnapshot equivalent # BatchEvalRunner returns results # but does not persist or compare # them across runs # Teams must build their own # snapshot + comparison logic # using the raw result dicts

LlamaIndex has the best batch evaluation of the three but lacks regression tracking. Teams using LlamaIndex must build their own snapshot persistence layer.

Custom Metrics

✓

SynapseKit

from synapsekit.eval import BaseEvaluator class ToxicityEvaluator(BaseEvaluator): async def aevaluate( self, query, response, contexts ): score = await self._judge(response) return EvalResult(score=score)

Extend BaseEvaluator. Async-native by design. Plugs directly into EvaluationPipeline and batch runner.

LangChain

# Any LCEL chain can be a metric from langchain_core.runnables import ( RunnableLambda ) def score_toxicity(inputs): return {"score": detect_toxicity( inputs["response"] )} chain = RunnableLambda(score_toxicity)

Maximum flexibility - any LCEL chain becomes a metric. No standardized return format, which makes aggregation harder across multiple metrics.

LlamaIndex

from llama_index.core.evaluation import ( BaseEvaluator, EvaluationResult ) class ToxicityEvaluator(BaseEvaluator): async def aevaluate( self, query=None, response=None, contexts=None, **kwargs ) -> EvaluationResult: ...

Extend BaseEvaluator. Standardized EvaluationResult return type means custom metrics compose seamlessly with BatchEvalRunner.

Feature Score (of 7)

7/7

SynapseKit

2/7

LangChain

6/7

LlamaIndex