Skip to main content

Module 06: LLM Evaluation

Evaluation is the discipline that separates research demos from production systems. A model that scores 90% on MMLU might hallucinate on every third user query. A model with low perplexity on Wikipedia might refuse to answer legitimate questions. Evaluation is not a checkbox - it is an ongoing engineering practice.

This module covers the full evaluation stack: from fundamental metrics like perplexity, through generation quality metrics, human evaluation methodology, automated LLM judging, benchmark ecosystems, safety evaluation, RAG-specific metrics, and finally production monitoring. Each layer builds on the last.

Why Evaluation Is Hard

The core challenge is that language is high-dimensional and context-dependent. There is no single number that captures whether a model is "good." A model might excel at factual recall but fail at reasoning. It might be accurate but unsafe. It might perform well on benchmarks that leaked into its training data. Good evaluation requires a portfolio of complementary techniques.

Module Map

Lessons in This Module

#LessonCore ConceptKey Tools
01Perplexity and Language Model MetricsHow surprised is the model?HuggingFace Transformers
02BLEU, ROUGE, and Generation MetricsReference-based qualityevaluate, BERTScore
03Human EvaluationGround truth from humansAnnotation platforms, ELO
04LLM-as-JudgeAutomated evaluation with LLMsOpenAI, Anthropic, Prometheus
05Benchmarks: MMLU, HumanEval, HELMStandardized capability measurementlm-eval-harness
06Safety and Bias EvaluationHarmful outputs, stereotypes, jailbreaksToxiGen, HarmBench
07RAG Evaluation MetricsFaithfulness, relevance, groundednessRAGAS, TruLens
08Production Monitoring for LLMsLatency, drift, cost, safetyLangSmith, Langfuse

Prerequisites

  • Familiarity with LLM architecture and training (Modules 01–05)
  • Basic statistics: probability distributions, expected value, variance
  • Python proficiency: working with HuggingFace and API clients

Key Concepts Glossary

Perplexity - The geometric mean inverse probability assigned to a test set; lower is better.

BLEU - Bilingual Evaluation Understudy; n-gram precision metric originally designed for machine translation.

ROUGE - Recall-Oriented Understudy for Gisting Evaluation; recall-based metric for summarization.

BERTScore - Semantic similarity metric using contextual embeddings from BERT.

LLM-as-Judge - Using a capable LLM (e.g., GPT-4) to score outputs of other models.

MT-Bench - Multi-turn benchmark for evaluating LLM conversation quality using GPT-4 as judge.

MMLU - Massive Multitask Language Understanding; 57-subject multiple-choice benchmark.

HumanEval - 164 Python programming problems evaluated by running unit tests.

HELM - Holistic Evaluation of Language Models; multi-metric evaluation framework from Stanford.

RAGAS - RAG Assessment framework for evaluating faithfulness, relevance, and context metrics.

RAG Triad - Answer Relevance + Context Relevance + Groundedness (TruLens framework).

ELO Rating - Chess-derived ranking system applied to human preference comparison of models.

Inter-Annotator Agreement (IAA) - Statistical measure of how consistently human annotators label data (Cohen's kappa).

Benchmark Contamination - When test data appears in training data, inflating benchmark scores.

Red Teaming - Adversarial evaluation where humans or automated systems attempt to elicit harmful outputs.

TTFT - Time to First Token; key latency metric for streaming LLM responses.

How to Use This Module

Each lesson is self-contained but builds on the previous. If you are building a RAG system today, jump to Lesson 07. If you are deploying a model to production, Lesson 08 is critical. For hiring or being hired as an ML engineer, the interview sections in every lesson cover the questions that actually get asked.

The code examples use real libraries - evaluate, ragas, langsmith, langchain, and the anthropic SDK. All examples are designed to run with minimal setup.

© 2026 EngineersOfAI. All rights reserved.