01Module 06: LLM EvaluationA complete guide to evaluating large language models - from perplexity to production monitoring.02Perplexity and Language Model MetricsUnderstand perplexity, cross-entropy, bits per byte, and when intrinsic metrics mislead you about model quality.03BLEU, ROUGE, and Generation MetricsMaster reference-based generation metrics - BLEU, ROUGE, BERTScore, BLEURT - and know exactly when each one lies to you.04Human EvaluationDesign rigorous human evaluation studies for LLMs - from annotation protocols to inter-annotator agreement to Chatbot Arena methodology.05LLM-as-JudgeUse powerful LLMs to automatically evaluate other models - with position bias mitigation, CoT judging, and cost analysis.06Benchmarks: MMLU, HumanEval, and HELMNavigate the LLM benchmark ecosystem - what each benchmark actually measures, saturation, contamination, and how to build benchmarks that can't be gamed.07Safety and Bias EvaluationEvaluate LLMs for harmful outputs, social bias, hallucination, and jailbreak vulnerability - including red teaming methodology and production monitoring.08RAG Evaluation MetricsEvaluate RAG systems with precision - the RAG triad, RAGAS framework, golden datasets, and retrieval metrics for production pipelines.09Production Monitoring for LLMsBuild a comprehensive production monitoring stack for LLMs - latency, cost, quality drift, safety, and observability platforms compared.