13 docs tagged with "alignment"

AI Safety Evaluations

Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.

Constitutional AI

How Anthropic replaced human feedback with AI feedback guided by explicit principles - the Constitutional AI technique, RLAIF, and how it enables scalable alignment.

DPO and Modern Alignment Techniques

Direct Preference Optimization and its successors - how DPO eliminates the need for a separate reward model and RL training, plus IPO, KTO, SimPO, and ORPO.

DPO: Direct Preference Optimization

Master DPO - the elegant insight that you can optimize LLMs for human preferences without training a reward model or running RL, derived directly from the optimal RLHF policy.

EU AI Act and Global AI Regulation

The EU AI Act, US executive orders, UK AI policy, China AI regulations, and practical compliance implications for AI engineers building and deploying language models.

Jailbreaks and Adversarial Prompts

How safety training gets bypassed - jailbreak taxonomy, GCG attacks, many-shot jailbreaking, prompt injection, defenses, and why the arms race is hard to win.

Survey the post-RLHF alignment landscape - RLAIF, Constitutional AI, rejection sampling fine-tuning, iterative DPO, process reward models, and the open questions shaping the next generation of aligned models.

Module 16 - Alignment and Safety

A complete guide to AI alignment, RLHF, Constitutional AI, DPO, red teaming, jailbreaks, safety evaluations, and the global regulatory landscape.

Red Teaming LLMs

Systematic adversarial evaluation of language models - manual red teaming, automated red teaming with LLMs, failure taxonomies, and building a production red team process.

Research Roadmap: RLHF & Alignment

From InstructGPT to DPO to ORPO. Read the 7 most important alignment papers in order — understanding how LLMs are made to follow human intent.

RLHF Deep Dive

A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.

RLHF: Reinforcement Learning from Human Feedback

Understand how RLHF aligns LLMs with human preferences through three phases - SFT, reward model training, and PPO - and why it produced InstructGPT's surprising result that smaller aligned models beat larger unaligned ones.

The Alignment Problem

Why making AI systems do what we actually want is harder than it looks - the specification problem, Goodhart's Law, reward hacking, and outer vs inner alignment.