AI Safety Evaluations
Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.
Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.
How Anthropic replaced human feedback with AI feedback guided by explicit principles - the Constitutional AI technique, RLAIF, and how it enables scalable alignment.
Direct Preference Optimization and its successors - how DPO eliminates the need for a separate reward model and RL training, plus IPO, KTO, SimPO, and ORPO.
Master DPO - the elegant insight that you can optimize LLMs for human preferences without training a reward model or running RL, derived directly from the optimal RLHF policy.
The EU AI Act, US executive orders, UK AI policy, China AI regulations, and practical compliance implications for AI engineers building and deploying language models.
How safety training gets bypassed - jailbreak taxonomy, GCG attacks, many-shot jailbreaking, prompt injection, defenses, and why the arms race is hard to win.
Survey the post-RLHF alignment landscape - RLAIF, Constitutional AI, rejection sampling fine-tuning, iterative DPO, process reward models, and the open questions shaping the next generation of aligned models.
A complete guide to AI alignment, RLHF, Constitutional AI, DPO, red teaming, jailbreaks, safety evaluations, and the global regulatory landscape.
Systematic adversarial evaluation of language models - manual red teaming, automated red teaming with LLMs, failure taxonomies, and building a production red team process.
From InstructGPT to DPO to ORPO. Read the 7 most important alignment papers in order — understanding how LLMs are made to follow human intent.
A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.
Understand how RLHF aligns LLMs with human preferences through three phases - SFT, reward model training, and PPO - and why it produced InstructGPT's surprising result that smaller aligned models beat larger unaligned ones.
Why making AI systems do what we actually want is harder than it looks - the specification problem, Goodhart's Law, reward hacking, and outer vs inner alignment.