History of LLM Evaluation - AI Letters #32

Era 1: Reference-Based Metrics (1990s - 2015)

1995

BLEU Score - Machine Translation

Papineni et al. introduce n-gram precision as a proxy for translation quality. Fast, reference-free at runtime, but blind to meaning. "The cat sat on the mat" and "A feline rested upon the rug" score zero despite perfect semantic alignment.

2004

ROUGE - Summarization Evaluation

Lin introduces recall-oriented BLEU variant for summarization. Becomes the standard for measuring abstractive summarization quality. Same fundamental flaw: word overlap without understanding.

2014

METEOR and BERTScore Predecessors

Researchers add synonym matching and paraphrase tables to address BLEU's surface-form blindness. Still requires human-written reference answers. Fails on open-ended generation where many correct answers exist.

Era 2: Embedding-Based Metrics (2018 - 2022)

2019

BERTScore Embedding Similarity

Zhang et al. use contextual BERT embeddings to compare responses. Finally captures semantic similarity - "cat" and "feline" score high. Still requires reference answers. Correlation with human judgment improves significantly over BLEU.

2021

RAG Arrives - Evaluation Gets Harder

Lewis et al. (Facebook) publish RAG. The evaluation problem splits in two: (1) is the answer faithful to retrieved context? (2) did we retrieve the right context? BERTScore answers neither question. New metrics needed.

2022

Faithfulness as a Metric Concept

Teams begin measuring whether generated answers are entailed by retrieved documents. Early approaches use NLI (natural language inference) models - slow, requires fine-tuning, mediocre performance on real RAG outputs.

Era 3: LLM-as-Judge (2023 - present)

2023 Q1

GPT-4 as Evaluator Breakthrough

Teams discover that GPT-4 grading responses correlates better with human judgment than BERTScore or BLEU on most tasks. OpenAI publishes Evals framework. The LLM-as-judge paradigm is established.

2023 Q3

RAGAS Framework Released

Shahul Es et al. release RAGAS - the first dedicated framework for RAG evaluation. Metrics: faithfulness, answer relevance, context precision, context recall. Uses LLM to grade LLM. Becomes the most cited RAG eval framework.

2023 Q4

LangChain Removes eval Module Breaking Change

LangChain 1.x ships without the langchain.evaluation module. Teams relying on load_evaluator() discover the API is gone. LangChain points teams to RAGAS or DeepEval as external dependencies.

2024

Framework Divergence

LlamaIndex ships built-in FaithfulnessEvaluator, RelevancyEvaluator, BatchEvalRunner. SynapseKit adds EvalSnapshot and EvalRegression for regression tracking. LangChain remains evaluation-free by design.

2025 - 2026

Continuous Evaluation as Standard Practice

Teams running nightly eval pipelines, regression gates in CI/CD, and automated drift detection. Heuristic evaluators (word overlap) used for low-latency checks; LLM-as-judge for high-stakes validation. The evaluation stack matures.