11 docs tagged with "safety"

AI Safety Evaluations

Safety benchmarks, capability evaluations, LLM judges, uplift assessments, and how labs like Anthropic use evaluation-gated deployment through Responsible Scaling Policies.

Constitutional AI

How Anthropic replaced human feedback with AI feedback guided by explicit principles - the Constitutional AI technique, RLAIF, and how it enables scalable alignment.

DPO and Modern Alignment Techniques

Direct Preference Optimization and its successors - how DPO eliminates the need for a separate reward model and RL training, plus IPO, KTO, SimPO, and ORPO.

EU AI Act and Global AI Regulation

The EU AI Act, US executive orders, UK AI policy, China AI regulations, and practical compliance implications for AI engineers building and deploying language models.

Hallucination Risk in Legal AI

Why LLM hallucination is malpractice in legal contexts, grounding strategies, citation verification pipelines, and architecture patterns for trustworthy legal AI.

Jailbreaks and Adversarial Prompts

How safety training gets bypassed - jailbreak taxonomy, GCG attacks, many-shot jailbreaking, prompt injection, defenses, and why the arms race is hard to win.

Module 16 - Alignment and Safety

A complete guide to AI alignment, RLHF, Constitutional AI, DPO, red teaming, jailbreaks, safety evaluations, and the global regulatory landscape.

Red Teaming LLMs

Systematic adversarial evaluation of language models - manual red teaming, automated red teaming with LLMs, failure taxonomies, and building a production red team process.

RLHF Deep Dive

A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.

Evaluating open-source models for safety and bias before production deployment - red-teaming, toxicity measurement, demographic bias benchmarks, jailbreak robustness, and building end-to-end safety evaluation pipelines.

The Alignment Problem

Why making AI systems do what we actually want is harder than it looks - the specification problem, Goodhart's Law, reward hacking, and outer vs inner alignment.