Module 16 - Alignment and Safety
Training a language model to be capable is a solved problem. Training it to be safe, helpful, and aligned with human values is the hardest unsolved problem in AI today.
This module covers everything you need to know about making AI systems behave the way we actually want - from the foundational alignment problem, through the major training techniques (RLHF, Constitutional AI, DPO), to adversarial testing and the emerging global regulatory landscape.
Module Map
Lessons in This Module
| # | Lesson | Core Concept |
|---|---|---|
| 01 | The Alignment Problem | Why specifying what we want is harder than training what we specify |
| 02 | RLHF Deep Dive | Three-phase pipeline: SFT → Reward Model → PPO fine-tuning |
| 03 | Constitutional AI | Replacing human feedback with AI feedback guided by principles |
| 04 | DPO and Modern Alignment | Direct preference optimization - simpler, often better than RLHF |
| 05 | Red Teaming LLMs | Systematic adversarial evaluation before deployment |
| 06 | Jailbreaks and Adversarial Prompts | How safety training gets bypassed and how to defend against it |
| 07 | AI Safety Evals | Benchmarks and evaluation frameworks for safety properties |
| 08 | EU AI Act and Regulation | The global regulatory landscape and its practical implications |
Key Concepts
- Alignment problem: The gap between what we specify in a reward and what we actually want
- Goodhart's Law: When a measure becomes a target, it ceases to be a good measure
- RLHF: Reinforcement Learning from Human Feedback - the dominant alignment technique 2022–2023
- Constitutional AI: Anthropic's technique for using AI to supervise AI using explicit principles
- DPO: Direct Preference Optimization - eliminates the need for a separate reward model
- Red teaming: Systematic adversarial testing to find failure modes before users do
- Jailbreak: A prompt designed to bypass safety training
- EU AI Act: The world's first comprehensive AI regulation, effective August 2024
Prerequisites
- Module 7 - Transformers (attention, residual connections)
- Module 12 - Training Dynamics (loss landscapes, gradient flow)
- Module 14 - Instruction Tuning (SFT basics)
- Module 15 - RLHF Fundamentals (helpful foundation but this module is self-contained)
© 2026 EngineersOfAI. All rights reserved.
