01Module 08: Agent EvaluationEvaluation is the most underrated problem in agentic AI. Without it, you cannot improve, catch regressions, or build trust. This module covers trajectory scoring, benchmarks, LLM-as-judge, human evaluation, and production monitoring.02Challenges of Evaluating AgentsWhy evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.03Trajectory EvaluationEvaluating the full action sequence, not just the final output - trajectory metrics, automatic scoring, and comparing agent versions.04GAIA BenchmarkGAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.05SWE-bench VerifiedSWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.06LLM as Agent JudgeUsing LLMs to evaluate other agents' trajectories and outputs at scale - rubric design, pairwise comparison, bias mitigation, calibration, and escalation logic.07Human Evaluation for AgentsWhen and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.08Production Agent MonitoringMonitoring agents in production - task completion metrics, distributed tracing, anomaly detection, alerting, and the production improvement flywheel.