Skip to main content

Module 16 - Alignment and Safety

Training a language model to be capable is a solved problem. Training it to be safe, helpful, and aligned with human values is the hardest unsolved problem in AI today.

This module covers everything you need to know about making AI systems behave the way we actually want - from the foundational alignment problem, through the major training techniques (RLHF, Constitutional AI, DPO), to adversarial testing and the emerging global regulatory landscape.

Module Map

Lessons in This Module

#LessonCore Concept
01The Alignment ProblemWhy specifying what we want is harder than training what we specify
02RLHF Deep DiveThree-phase pipeline: SFT → Reward Model → PPO fine-tuning
03Constitutional AIReplacing human feedback with AI feedback guided by principles
04DPO and Modern AlignmentDirect preference optimization - simpler, often better than RLHF
05Red Teaming LLMsSystematic adversarial evaluation before deployment
06Jailbreaks and Adversarial PromptsHow safety training gets bypassed and how to defend against it
07AI Safety EvalsBenchmarks and evaluation frameworks for safety properties
08EU AI Act and RegulationThe global regulatory landscape and its practical implications

Key Concepts

  • Alignment problem: The gap between what we specify in a reward and what we actually want
  • Goodhart's Law: When a measure becomes a target, it ceases to be a good measure
  • RLHF: Reinforcement Learning from Human Feedback - the dominant alignment technique 2022–2023
  • Constitutional AI: Anthropic's technique for using AI to supervise AI using explicit principles
  • DPO: Direct Preference Optimization - eliminates the need for a separate reward model
  • Red teaming: Systematic adversarial testing to find failure modes before users do
  • Jailbreak: A prompt designed to bypass safety training
  • EU AI Act: The world's first comprehensive AI regulation, effective August 2024

Prerequisites

  • Module 7 - Transformers (attention, residual connections)
  • Module 12 - Training Dynamics (loss landscapes, gradient flow)
  • Module 14 - Instruction Tuning (SFT basics)
  • Module 15 - RLHF Fundamentals (helpful foundation but this module is self-contained)
© 2026 EngineersOfAI. All rights reserved.