Master reference-based generation metrics - BLEU, ROUGE, BERTScore, BLEURT - and know exactly when each one lies to you.

How does ROUGE score work in practice?

BLEU, ROUGE, and Generation Metrics covers BLEU score, ROUGE score, BERTScore from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-evaluation/bleu-rouge-and-generation-metrics

What is the difference between BLEU score and BERTScore?

See the full breakdown at https://engineersofai.com/docs/llms/llm-evaluation/bleu-rouge-and-generation-metrics

BLEU, ROUGE, and Generation Metrics

The Summarization Disaster

Your team deploys a text summarization model. The ROUGE-L score on the validation set is 0.42 - solid, well above the 0.35 baseline. The model ships. A week later, your head of product walks over with a printout. The model is producing summaries that miss the most important facts, include irrelevant details, and occasionally contradict the source text. But every single one of them has high lexical overlap with the reference summaries in your evaluation set.

The model had learned to produce summaries that looked like the reference summaries linguistically - same word choices, same sentence length patterns - without actually understanding what was important to include. It had hacked the metric.

BLEU did the same thing to the machine translation field for two decades. Models optimized for BLEU learned to produce translations that scored well against reference translations without actually producing natural, accurate target-language text. It took the field years to realize that the metric had become a target rather than a measure.

This lesson is about why reference-based metrics exist, what they actually measure, what they miss, and when you should trust them. Understanding these metrics deeply - including their failure modes - is essential for anyone building systems that generate text.

Why This Exists - The Reference-Based Evaluation Problem

When you train a classifier, evaluation is straightforward: predicted class vs. ground truth class. Either correct or not.

Text generation is different. There is no single correct output. The question "What is the capital of France?" has one answer, but "Summarize this article" has thousands of equally valid summaries. "Translate this to French" has multiple correct translations with different stylistic choices, word orders, and formality levels.

Before BLEU, machine translation researchers evaluated translations by paying bilingual human experts. This was expensive, slow, inconsistent, and did not scale to the thousands of comparisons needed during model development. The field needed an automatic metric that correlated with human judgment.

BLEU (Bilingual Evaluation Understudy) was the answer. It was never meant to be a perfect measure of translation quality - it was meant to be a fast, cheap, automatic proxy that correlated well enough with human judgment that you could use it for development. That distinction - proxy metric vs ground truth - got lost over time, and the metric became conflated with the thing it was supposed to approximate.

Historical Context

Year	Milestone
2002	BLEU paper: Papineni et al., IBM (ACL 2002) - the original machine translation metric
2004	ROUGE: Chin-Yew Lin, USC/ISI - BLEU's recall-oriented cousin for summarization
2005	METEOR: Banerjee and Lavie - adds stemming, synonyms, better correlation with humans
2015	Word Mover's Distance: Kusner et al. - embedding-based distance
2019	BERTScore: Zhang et al. - BERT embeddings for semantic similarity
2020	BLEURT: Sellam et al. - learned regression model on top of BERT
2021	COMET: learned metric specifically for MT, trained on human judgments
2022	UniEval: multi-dimensional evaluation using question-answering formulation

BLEU: Bilingual Evaluation Understudy

The Core Idea

BLEU measures n-gram precision: what fraction of n-grams in the model output appear in the reference translation? The intuition is that a good translation should share many word sequences with a human reference.

For a candidate translation $C$ and reference translation(s) $R$ :

Unigram precision (BLEU-1):

$P_1 = \frac{\text{count of unigrams in } C \text{ that appear in } R}{\text{total unigrams in } C}$

n-gram precision (BLEU-N):

$P_n = \frac{\sum_{\text{n-gram} \in C} \text{Count}_{clip}(\text{n-gram})}{\sum_{\text{n-gram} \in C} \text{Count}(\text{n-gram})}$

where $\text{Count}_{clip}$ caps the count at the maximum number of times that n-gram appears in any single reference.

The Brevity Penalty

A model can game BLEU-1 by outputting only one word: the most common word in all references. Precision would be 1.0 (if that word appears in the reference). To prevent this, BLEU applies a brevity penalty (BP):

$BP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \leq r \end{cases}$

where $c$ is candidate length and $r$ is reference length (or the closest reference length in a multi-reference setting).

Full BLEU Score

$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log P_n\right)$

In practice, BLEU-4 uses $N=4$ with uniform weights $w_n = 1/4$ :

$BLEU\text{-}4 = BP \cdot \exp\left(\frac{1}{4}\sum_{n=1}^{4} \log P_n\right)$

BLEU in Practice

Typical BLEU-4 score ranges:

less than 10: poor translation, only basic phrases match
10–19: hard to follow translation
20–29: understandable but significant errors
30–40: good quality translation
40–50: high quality, matches expert translator level
50+: better than human average (rare, may indicate test set issues)

:::note Multi-Reference BLEU BLEU works better with multiple reference translations. Each candidate n-gram is clipped against the maximum count across all references. Always use all available references when evaluating. :::

ROUGE: Recall-Oriented Understudy for Gisting Evaluation

Why ROUGE, Not BLEU, for Summarization

BLEU measures precision: how much of the output is relevant? For translation, this makes sense - you want every output word to be correct.

For summarization, recall matters more: does the output cover the important information from the source? A short summary that covers only part of the source material would have high BLEU (precision) but misses key content.

ROUGE flips the ratio: instead of "how much of the output appears in the reference," it asks "how much of the reference appears in the output."

ROUGE-N

$ROUGE\text{-}N = \frac{\sum_{\text{n-gram} \in \text{Reference}} \text{Count}_{match}(\text{n-gram})}{\sum_{\text{n-gram} \in \text{Reference}} \text{Count}(\text{n-gram})}$

ROUGE-1 (unigram recall) and ROUGE-2 (bigram recall) are the most commonly reported variants.

ROUGE-L

ROUGE-L uses the Longest Common Subsequence (LCS) between the candidate and reference. LCS does not require consecutive matches - it finds the longest sequence of words that appear in the same order in both texts, even with gaps.

$ROUGE\text{-}L = \frac{LCS(C, R)}{|R|}$

ROUGE-L is often preferred because it captures sentence-level structure and is more flexible than strict n-gram matching.

Practical ROUGE Values (Summarization)

For CNN/DailyMail news summarization:

Extractive baseline (lead-3 sentences): ROUGE-1 ~40, ROUGE-2 ~17, ROUGE-L ~36
BART fine-tuned: ROUGE-1 ~44, ROUGE-2 ~21, ROUGE-L ~40
T5-large fine-tuned: ROUGE-1 ~43, ROUGE-2 ~21, ROUGE-L ~40

These numbers vary by tokenization and normalization - always use the same evaluation script for fair comparison.

METEOR: The More Principled Alternative

METEOR (Metric for Evaluation of Translation with Explicit ORdering) addresses several BLEU limitations:

Stemming: "running" matches "run" in METEOR, not in BLEU
Synonym matching: uses WordNet to match synonyms
Paraphrase matching: extends to paraphrase tables
F-score: harmonic mean of unigram precision and recall (not just precision)

$METEOR = F_{mean} \cdot (1 - Penalty)$

where $Penalty = 0.5 \cdot \left(\frac{\text{chunks}}{\text{unigram matches}}\right)^3$

Chunks are contiguous matched unigrams. The penalty increases if matches are scattered (not in order).

METEOR correlates better with human judgment than BLEU in most studies, especially for languages with rich morphology. However, it is slower and requires external resources (WordNet, paraphrase tables).

BERTScore: Semantic Similarity with Embeddings

The Semantic Gap

All n-gram-based metrics share a fundamental flaw: they measure lexical overlap, not semantic similarity. "The vehicle stopped" scores zero overlap against "The car halted" - yet these mean the same thing. A perfect translation might use different but equivalent vocabulary and score badly on BLEU.

BERTScore (Zhang et al., 2019) solves this by using contextual embeddings from BERT to measure semantic similarity between candidate and reference tokens.

How BERTScore Works

Given candidate tokens $\hat{x}_1, ..., \hat{x}_k$ and reference tokens $x_1, ..., x_m$ , pass both through BERT to get contextual embeddings. Then:

Recall (reference coverage):

$R_{BERT} = \frac{1}{|x|} \sum_{x_j \in x} \max_{\hat{x}_i \in \hat{x}} \mathbf{x}_j^T \hat{\mathbf{x}}_i$

Precision (candidate coverage):

$P_{BERT} = \frac{1}{|\hat{x}|} \sum_{\hat{x}_i \in \hat{x}} \max_{x_j \in x} \hat{\mathbf{x}}_i^T \mathbf{x}_j$

F1:

$F_{BERT} = 2 \cdot \frac{P_{BERT} \cdot R_{BERT}}{P_{BERT} + R_{BERT}}$

Each candidate token is matched to the most similar reference token (by cosine similarity), and vice versa. This allows flexible matching even when word choice differs.

BERTScore in Practice

BERTScore is normalized to a range that is roughly [0.8, 1.0] for reasonable text (raw cosine similarities are high between common English words). Always compare models on the same dataset and same layer of BERT - do not compare absolute BERTScore values across different papers without verifying methodology.

BERTScore correlates better with human judgment than BLEU or ROUGE in most benchmarks. For summarization on CNN/DailyMail, BERTScore-F1 has Pearson correlation ~0.45 with human ratings, vs ROUGE-1's ~0.35.

BLEURT: Learned Evaluation

BLEURT (Bilingual Evaluation Understudy with Representations from Transformers, Sellam et al. 2020) takes the learned metric approach: train a regression model on top of BERT, using human ratings as supervision.

Training procedure:

Pre-train on synthetic perturbations (swapping, dropping, inserting words)
Fine-tune on WMT human evaluation data (human-rated translation quality scores)

BLEURT achieves higher correlation with human judgment than BERTScore and BLEU, but has several practical limitations: it requires training data in the target language, it can fail to generalize to new domains, and the trained model is a black box.

The Metric Decision Table

Task	Primary Metric	Secondary Metric	When to Use Human Eval
Machine Translation	BLEU-4 (+ COMET)	METEOR, chrF	New language pair, high stakes
News Summarization	ROUGE-1/2/L	BERTScore	Safety-critical summaries
Open-ended Generation	BERTScore	Human rating	Always for production
Code Generation	Execution accuracy	BLEU (secondary)	Complex tasks
Question Answering	Exact match / F1	BERTScore	When paraphrase is valid
Dialogue	None reliable	BERTScore	Always for production
RAG Responses	Faithfulness	BERTScore	Always for production

The Correlation Problem

The fundamental question: how well do these automatic metrics actually correlate with human judgment?

Meta-evaluation studies (evaluating the evaluators) show:

BLEU-4 Pearson correlation with human MT quality: ~0.40–0.60 depending on language pair
ROUGE-1 with human summarization quality: ~0.30–0.45
BERTScore with human MT quality: ~0.50–0.65
BLEURT with human MT quality: ~0.60–0.70
COMET (learned, MT-specific): ~0.75+

Even COMET - the best automatic MT metric - leaves 25%+ of variance unexplained. For open-ended generation, correlations drop further. For safety-critical applications, none of these metrics are sufficient.

Code: Computing All Metrics

import evaluate
from bert_score import score as bert_score
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from typing import List, Tuple
import numpy as np

# Install dependencies:
# pip install evaluate bert-score rouge-score nltk

class GenerationEvaluator:
    """
    Comprehensive generation quality evaluator.
    Computes BLEU, ROUGE, and BERTScore for candidate/reference pairs.
    """

    def __init__(self, bert_model: str = "microsoft/deberta-xlarge-mnli"):
        self.rouge = rouge_scorer.RougeScorer(
            ["rouge1", "rouge2", "rougeL"],
            use_stemmer=True
        )
        self.bert_model = bert_model

    def compute_bleu(
        self,
        candidates: List[str],
        references: List[List[str]],
    ) -> dict:
        """
        Compute corpus-level BLEU scores (BLEU-1 through BLEU-4).

        Args:
            candidates: List of generated texts
            references: List of reference lists (each candidate can have multiple references)

        Returns:
            Dictionary with BLEU-1 through BLEU-4 scores
        """
        # Tokenize
        tokenized_candidates = [c.split() for c in candidates]
        tokenized_references = [[r.split() for r in refs] for refs in references]

        smoothing = SmoothingFunction().method1  # Handles zero n-gram counts

        # Corpus BLEU with different n-gram weights
        bleu_scores = {}
        for n in range(1, 5):
            weights = [1/n] * n + [0] * (4 - n)
            bleu_scores[f"bleu_{n}"] = corpus_bleu(
                tokenized_references,
                tokenized_candidates,
                weights=weights,
                smoothing_function=smoothing,
            )

        return bleu_scores

    def compute_rouge(
        self,
        candidates: List[str],
        references: List[str],
    ) -> dict:
        """
        Compute corpus-level ROUGE scores.
        """
        rouge1_scores = []
        rouge2_scores = []
        rougeL_scores = []

        for cand, ref in zip(candidates, references):
            scores = self.rouge.score(ref, cand)
            rouge1_scores.append(scores["rouge1"].fmeasure)
            rouge2_scores.append(scores["rouge2"].fmeasure)
            rougeL_scores.append(scores["rougeL"].fmeasure)

        return {
            "rouge1": np.mean(rouge1_scores),
            "rouge2": np.mean(rouge2_scores),
            "rougeL": np.mean(rougeL_scores),
            "rouge1_std": np.std(rouge1_scores),
        }

    def compute_bertscore(
        self,
        candidates: List[str],
        references: List[str],
        lang: str = "en",
    ) -> dict:
        """
        Compute BERTScore precision, recall, and F1.
        """
        P, R, F1 = bert_score(
            candidates,
            references,
            lang=lang,
            model_type=self.bert_model,
            verbose=False,
        )

        return {
            "bertscore_precision": P.mean().item(),
            "bertscore_recall": R.mean().item(),
            "bertscore_f1": F1.mean().item(),
        }

    def evaluate_all(
        self,
        candidates: List[str],
        references: List[str],
    ) -> dict:
        """
        Run all metrics and return a combined report.
        """
        ref_lists = [[r] for r in references]  # Single reference per candidate

        bleu = self.compute_bleu(candidates, ref_lists)
        rouge = self.compute_rouge(candidates, references)
        bert = self.compute_bertscore(candidates, references)

        return {**bleu, **rouge, **bert}


# Example: evaluate a summarization model
def run_summarization_evaluation():
    candidates = [
        "The president signed the climate bill into law on Monday.",
        "Scientists discovered a new species of deep sea fish.",
        "Tech stocks fell sharply after the Federal Reserve announcement.",
    ]

    references = [
        "President Biden signed the landmark climate legislation Monday afternoon.",
        "Marine biologists have identified a previously unknown deep-sea fish species.",
        "Technology stocks dropped significantly following the Fed's interest rate decision.",
    ]

    evaluator = GenerationEvaluator()

    print("Running evaluation...")
    results = evaluator.evaluate_all(candidates, references)

    print("\n=== Summarization Evaluation Results ===")
    print(f"BLEU-1:           {results['bleu_1']:.4f}")
    print(f"BLEU-2:           {results['bleu_2']:.4f}")
    print(f"BLEU-4:           {results['bleu_4']:.4f}")
    print(f"ROUGE-1 F1:       {results['rouge1']:.4f}")
    print(f"ROUGE-2 F1:       {results['rouge2']:.4f}")
    print(f"ROUGE-L F1:       {results['rougeL']:.4f}")
    print(f"BERTScore P:      {results['bertscore_precision']:.4f}")
    print(f"BERTScore R:      {results['bertscore_recall']:.4f}")
    print(f"BERTScore F1:     {results['bertscore_f1']:.4f}")

    return results


# Compare two models
def compare_models(
    model_a_outputs: List[str],
    model_b_outputs: List[str],
    references: List[str],
) -> None:
    """
    Side-by-side metric comparison for two models.
    """
    evaluator = GenerationEvaluator()

    results_a = evaluator.evaluate_all(model_a_outputs, references)
    results_b = evaluator.evaluate_all(model_b_outputs, references)

    print(f"\n{'Metric':<25} {'Model A':>10} {'Model B':>10} {'Delta':>10}")
    print("-" * 55)

    for key in ["bleu_4", "rouge1", "rouge2", "rougeL", "bertscore_f1"]:
        a_val = results_a[key]
        b_val = results_b[key]
        delta = b_val - a_val
        sign = "+" if delta > 0 else ""
        print(f"{key:<25} {a_val:>10.4f} {b_val:>10.4f} {sign}{delta:>9.4f}")

Using HuggingFace Evaluate

import evaluate

def quick_evaluate(candidates: list, references: list):
    """
    Quick evaluation using HuggingFace evaluate library.
    Handles tokenization and edge cases automatically.
    """
    bleu = evaluate.load("bleu")
    rouge = evaluate.load("rouge")
    bertscore = evaluate.load("bertscore")
    meteor = evaluate.load("meteor")

    # BLEU expects list of lists for references
    bleu_result = bleu.compute(
        predictions=candidates,
        references=[[r] for r in references],
    )

    rouge_result = rouge.compute(
        predictions=candidates,
        references=references,
        use_stemmer=True,
    )

    bertscore_result = bertscore.compute(
        predictions=candidates,
        references=references,
        lang="en",
    )

    meteor_result = meteor.compute(
        predictions=candidates,
        references=references,
    )

    return {
        "bleu": bleu_result["bleu"],
        "rouge1": rouge_result["rouge1"],
        "rouge2": rouge_result["rouge2"],
        "rougeL": rouge_result["rougeL"],
        "bertscore_f1": sum(bertscore_result["f1"]) / len(bertscore_result["f1"]),
        "meteor": meteor_result["meteor"],
    }

Mermaid: Metric Landscape

Production Engineering Notes

Reference Quality Is Everything

All reference-based metrics are only as good as your references. Low-quality references - written by non-native speakers, truncated, or stylistically inconsistent - will give you meaningless metric values.

Invest in reference quality:

Have multiple annotators write references independently
Measure inter-annotator agreement among references
Include diverse references covering different valid paraphrases
For domain-specific tasks, use domain experts

Multiple References Dramatically Improve BLEU

Standard BLEU with a single reference underestimates translation quality because any valid translation that differs in word choice gets penalized. Use multiple references wherever possible:

# Single reference - penalizes valid paraphrases
single_ref_score = bleu.compute(
    predictions=["The car stopped at the traffic light"],
    references=[["The vehicle halted at the signal"]],
)

# Multiple references - more fair evaluation
multi_ref_score = bleu.compute(
    predictions=["The car stopped at the traffic light"],
    references=[[
        "The vehicle halted at the signal",
        "The car stopped at the red light",
        "The automobile came to a halt at the traffic light",
    ]],
)
# multi_ref_score will be significantly higher (and more accurate)

BLEU Normalization Matters

Different implementations of BLEU can give significantly different numbers because of tokenization and normalization choices. Always use the same implementation script when comparing across models or papers. The standard is SacreBLEU:

# pip install sacrebleu
import sacrebleu

def compute_sacrebleu(hypotheses: list, references: list) -> float:
    """
    SacreBLEU: standardized, reproducible BLEU computation.
    Handles tokenization consistently - use this for fair comparison.
    """
    # References must be a list of lists (one list per reference set)
    refs = [references]  # One reference per hypothesis

    result = sacrebleu.corpus_bleu(hypotheses, refs)

    print(f"BLEU:  {result.score:.2f}")
    print(f"BP:    {result.bp:.4f}")
    print(f"Ratio: {result.sys_len}/{result.ref_len}")
    print(f"N-grams: {result.counts}")

    return result.score

:::danger BLEU Is Not for Dialogue or Open-Ended Generation BLEU was designed for machine translation with clear reference translations. Using it to evaluate chatbot responses, creative writing, or open-ended QA is methodologically wrong. The correlation with human judgment drops to near-zero for these tasks. Use BERTScore or human evaluation instead. :::

:::warning The Short Hypothesis Trap BLEU's brevity penalty is not aggressive enough to prevent short-hypothesis gaming. A system that outputs one-sentence responses when five-sentence responses are needed will have high BLEU precision but poor recall. Always report BLEU alongside output length statistics. :::

Common Mistakes

:::danger Comparing BLEU Scores Across Papers BLEU scores are highly sensitive to tokenization, case normalization, and evaluation script. Two papers reporting BLEU-4 on the same dataset can give different numbers simply due to implementation differences. Always recompute using the same script (SacreBLEU is the standard). :::

:::warning Using ROUGE as the Only Summarization Metric ROUGE measures word overlap, not factual accuracy. A summary can have high ROUGE and be factually wrong. Always complement ROUGE with a faithfulness metric (see Module 07 on RAG evaluation) or human evaluation. :::

:::danger Treating Metric Improvements as Real User Improvements A 0.5 ROUGE point improvement on CNN/DailyMail does not necessarily mean users will notice the difference. Below ~2 ROUGE points, improvements are often imperceptible to humans. Always back up metric improvements with human evaluation before making product claims. :::

Interview Q&A

Q1: Why did BLEU become the dominant machine translation metric despite its flaws?

BLEU became dominant in 2002 because it was the first automatic metric that demonstrated statistically significant correlation with human judgment across multiple language pairs - a correlation the field had never had before. It was also fast, deterministic, and required no additional resources beyond the reference translations. The timing mattered: BLEU arrived just as statistical MT was taking off, and the research community needed a scalable evaluation signal for model development iteration. By the time the community understood its limitations (poor correlation with human judgment for non-MT tasks, gaming susceptibility, no semantic understanding), it had become the standard reporting convention. Changing standards in ML research is extremely difficult even when the community knows the standard is imperfect.

Q2: Explain the difference between BLEU and ROUGE. When would you use each?

BLEU measures precision: what fraction of n-grams in the generated text appear in the reference. ROUGE measures recall: what fraction of n-grams in the reference appear in the generated text. This difference matters because precision and recall optimize for different errors. For machine translation, you want everything in the output to be correct (precision focus) - a few wrong words are acceptable but outputting garbage is not. For summarization, you want to cover all the key information from the source (recall focus) - missing important facts is the primary failure mode. Use BLEU for MT and code generation where output correctness is paramount; use ROUGE for summarization and extraction tasks where coverage is paramount.

Q3: What is BERTScore and why is it better than BLEU for open-ended generation?

BERTScore computes semantic similarity between generated and reference text using contextual embeddings from BERT. Each token in the candidate is matched to the most semantically similar token in the reference (using cosine similarity of BERT embeddings), and precision, recall, and F1 are computed over these best-match similarities. This is better than BLEU for open-ended generation because: (1) it captures semantic equivalence that lexical overlap misses - "car" and "vehicle" are semantically similar but count as zero BLEU overlap; (2) BERT's contextual embeddings capture meaning in context, not just individual word meanings; (3) it correlates significantly better with human judgment across diverse generation tasks. The tradeoff is cost: BERTScore requires a forward pass through BERT for every evaluation, making it 10–100x slower than BLEU.

Q4: A new summarization model improves ROUGE-1 by 0.3 points but a human evaluator says it is worse. How do you investigate?

Start by checking what changed. Ask the human evaluator to explain why they think it is worse - common reasons include: (1) the model changed writing style to match reference style but now misses key facts (factual recall issue); (2) it uses more formal/complex language that overlaps better with formal references but is harder to read; (3) it is longer, which increases recall mechanically without adding value. Then design a targeted evaluation: (1) check factual accuracy - use a QA-based metric or manually verify key claims; (2) check coherence - does the summary flow logically; (3) check compression - is the summary concise or verbose. The 0.3 point ROUGE improvement is likely coming from one of these hacks. The fix is to add additional metrics (faithfulness, coherence, compression ratio) alongside ROUGE.

Q5: How do learned metrics like BLEURT differ from BERTScore?

BERTScore is unsupervised - it uses pre-trained BERT embeddings without any fine-tuning on translation quality data. It measures geometric similarity in embedding space, which correlates with semantic similarity but was not optimized to predict human translation quality scores.

BLEURT is supervised - it trains a regression model on top of BERT, using human translation quality ratings from WMT shared tasks as labels. This gives it a fundamentally different failure mode: while BERTScore can fail when BERT embeddings don't capture relevant semantic dimensions, BLEURT can fail by overfitting to the biases of human raters in the WMT training data. BLEURT typically achieves higher correlation with human judgment on standard MT benchmarks, but it may generalize poorly to new domains or languages not represented in its training data.

Q6: How would you choose evaluation metrics for a new product feature - a model that helps users rephrase their emails?

For email rephrasing, standard metrics are unreliable because: (1) there is no single correct rephrasing - many valid paraphrases exist; (2) the metric needs to capture meaning preservation AND stylistic transformation simultaneously. My approach: (1) Meaning preservation - use BERTScore between input and output as a rough proxy for semantic similarity; flag cases where it drops below a threshold. (2) Style compliance - if the user specifies "formal" or "concise," use a separate classifier trained to detect formality/conciseness. (3) Grammar and fluency - use a language model perplexity check or a grammar model. (4) Human evaluation - for the initial launch, run A/B testing with user feedback (thumbs up/down on output quality). (5) Long-term - collect implicit feedback signals (did the user use the suggestion as-is, edit it heavily, or discard it) and use those to build a dataset for a learned metric.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the BLEU, ROUGE & METEOR Metrics demo on the EngineersOfAI Playground - no code required.

:::

The Summarization Disaster​

Why This Exists - The Reference-Based Evaluation Problem​

Historical Context​

BLEU: Bilingual Evaluation Understudy​

The Core Idea​

The Brevity Penalty​

Full BLEU Score​

BLEU in Practice​

ROUGE: Recall-Oriented Understudy for Gisting Evaluation​

Why ROUGE, Not BLEU, for Summarization​

ROUGE-N​

ROUGE-L​

Practical ROUGE Values (Summarization)​

METEOR: The More Principled Alternative​

BERTScore: Semantic Similarity with Embeddings​

The Semantic Gap​

How BERTScore Works​

BERTScore in Practice​

BLEURT: Learned Evaluation​

The Metric Decision Table​

The Correlation Problem​

Code: Computing All Metrics​

Using HuggingFace Evaluate​

Mermaid: Metric Landscape​

Production Engineering Notes​

Reference Quality Is Everything​

Multiple References Dramatically Improve BLEU​

BLEU Normalization Matters​

Common Mistakes​

Interview Q&A​