History of LLM Evaluation

From BLEU scores (1960s) to LLM-as-judge (2023+) - how the field evolved from word-overlap heuristics to model-graded faithfulness

Evaluation Method Capabilities Over Time
Higher = more aligned with human judgment of correctness
Era 1: Reference-Based Metrics (1990s - 2015)
1995
BLEU Score - Machine Translation
Papineni et al. introduce n-gram precision as a proxy for translation quality. Fast, reference-free at runtime, but blind to meaning. "The cat sat on the mat" and "A feline rested upon the rug" score zero despite perfect semantic alignment.
2004
ROUGE - Summarization Evaluation
Lin introduces recall-oriented BLEU variant for summarization. Becomes the standard for measuring abstractive summarization quality. Same fundamental flaw: word overlap without understanding.
2014
METEOR and BERTScore Predecessors
Researchers add synonym matching and paraphrase tables to address BLEU's surface-form blindness. Still requires human-written reference answers. Fails on open-ended generation where many correct answers exist.
Era 2: Embedding-Based Metrics (2018 - 2022)
2019
BERTScore Embedding Similarity
Zhang et al. use contextual BERT embeddings to compare responses. Finally captures semantic similarity - "cat" and "feline" score high. Still requires reference answers. Correlation with human judgment improves significantly over BLEU.
2021
RAG Arrives - Evaluation Gets Harder
Lewis et al. (Facebook) publish RAG. The evaluation problem splits in two: (1) is the answer faithful to retrieved context? (2) did we retrieve the right context? BERTScore answers neither question. New metrics needed.
2022
Faithfulness as a Metric Concept
Teams begin measuring whether generated answers are entailed by retrieved documents. Early approaches use NLI (natural language inference) models - slow, requires fine-tuning, mediocre performance on real RAG outputs.
Era 3: LLM-as-Judge (2023 - present)
2023 Q1
GPT-4 as Evaluator Breakthrough
Teams discover that GPT-4 grading responses correlates better with human judgment than BERTScore or BLEU on most tasks. OpenAI publishes Evals framework. The LLM-as-judge paradigm is established.
2023 Q3
RAGAS Framework Released
Shahul Es et al. release RAGAS - the first dedicated framework for RAG evaluation. Metrics: faithfulness, answer relevance, context precision, context recall. Uses LLM to grade LLM. Becomes the most cited RAG eval framework.
2023 Q4
LangChain Removes eval Module Breaking Change
LangChain 1.x ships without the langchain.evaluation module. Teams relying on load_evaluator() discover the API is gone. LangChain points teams to RAGAS or DeepEval as external dependencies.
2024
Framework Divergence
LlamaIndex ships built-in FaithfulnessEvaluator, RelevancyEvaluator, BatchEvalRunner. SynapseKit adds EvalSnapshot and EvalRegression for regression tracking. LangChain remains evaluation-free by design.
2025 - 2026
Continuous Evaluation as Standard Practice
Teams running nightly eval pipelines, regression gates in CI/CD, and automated drift detection. Heuristic evaluators (word overlap) used for low-latency checks; LLM-as-judge for high-stakes validation. The evaluation stack matures.
www.engineersofai.com - AI Letters #32