Module 16 - Alignment and Safety

Training a language model to be capable is a solved problem. Training it to be safe, helpful, and aligned with human values is the hardest unsolved problem in AI today.

This module covers everything you need to know about making AI systems behave the way we actually want - from the foundational alignment problem, through the major training techniques (RLHF, Constitutional AI, DPO), to adversarial testing and the emerging global regulatory landscape.

Module Map

Lessons in This Module

#	Lesson	Core Concept
01	The Alignment Problem	Why specifying what we want is harder than training what we specify
02	RLHF Deep Dive	Three-phase pipeline: SFT → Reward Model → PPO fine-tuning
03	Constitutional AI	Replacing human feedback with AI feedback guided by principles
04	DPO and Modern Alignment	Direct preference optimization - simpler, often better than RLHF
05	Red Teaming LLMs	Systematic adversarial evaluation before deployment
06	Jailbreaks and Adversarial Prompts	How safety training gets bypassed and how to defend against it
07	AI Safety Evals	Benchmarks and evaluation frameworks for safety properties
08	EU AI Act and Regulation	The global regulatory landscape and its practical implications

Key Concepts

Alignment problem: The gap between what we specify in a reward and what we actually want
Goodhart's Law: When a measure becomes a target, it ceases to be a good measure
RLHF: Reinforcement Learning from Human Feedback - the dominant alignment technique 2022–2023
Constitutional AI: Anthropic's technique for using AI to supervise AI using explicit principles
DPO: Direct Preference Optimization - eliminates the need for a separate reward model
Red teaming: Systematic adversarial testing to find failure modes before users do
Jailbreak: A prompt designed to bypass safety training
EU AI Act: The world's first comprehensive AI regulation, effective August 2024

Prerequisites

Module 7 - Transformers (attention, residual connections)
Module 12 - Training Dynamics (loss landscapes, gradient flow)
Module 14 - Instruction Tuning (SFT basics)
Module 15 - RLHF Fundamentals (helpful foundation but this module is self-contained)

Module Map​

Lessons in This Module​

Key Concepts​

Prerequisites​

Module Map

Lessons in This Module

Key Concepts

Prerequisites