Skip to main content

Constitutional AI

Reading time: 25 min | Relevance: AI Engineer, Research Engineer, ML Engineer


The Bottleneck Problem in RLHF

Six months after deploying your RLHF-trained assistant, you want to expand it to medical use cases. Your RLHF pipeline needs preference data - but medical preference data requires labelers who are doctors or supervised by doctors. Finding clinicians willing to label LLM outputs at scale is not happening. Your annotation workforce can rank creative writing and basic coding help. They cannot reliably judge whether a medical response is correct, appropriately cautious, or dangerously overconfident.

This is the human bottleneck problem in RLHF. As you try to deploy aligned models in specialized domains - law, medicine, chemistry, financial advice - the labeler requirement becomes a hard constraint. You need domain experts to generate preferences, and domain experts are expensive, scarce, and unwilling to do annotation work at scale. Standard RLHF does not scale to these domains easily.

Anthropic's 2022 paper "Constitutional AI: Harmlessness from AI Feedback" (Bai et al.) proposed a solution. Instead of asking human labelers to evaluate responses, use the AI itself to evaluate its responses against a set of explicit principles - a "constitution." The principles are written by humans. The evaluation is done by AI. This dramatically reduces the dependency on human feedback while maintaining (and in some cases improving) alignment quality.

Constitutional AI enables a new paradigm: instead of encoding human values in a reward model trained from human preferences, you encode them in a written document of principles that the AI can reason about and apply. This is more transparent (the principles are readable), more scalable (the AI can apply them to any domain), and more controllable (you can update the constitution without retraining the reward model from scratch).


Historical Context

December 2022 - Yuntao Bai, Saurav Kadavath, et al. at Anthropic publish "Constitutional AI: Harmlessness from AI Feedback." This is the paper that introduces Constitutional AI (CAI).

Context at Anthropic: Anthropic was founded in 2021 by former OpenAI researchers with a specific focus on AI safety. Their research program includes both capability development and safety research. Constitutional AI emerged from their work on the HHH (Helpful, Harmless, Honest) framework - making AI systems that balance all three properties without requiring unlimited human annotation.

Scalable oversight: Constitutional AI is an instance of the broader "scalable oversight" research agenda. The core question: as AI systems become more capable, how do we supervise them using limited human oversight? If the AI is doing tasks humans cannot fully evaluate, we need AI assistance in the evaluation. Constitutional AI is one answer: use AI to evaluate AI, but constrain the AI evaluator with human-written principles.

RLAIF: The term "RLAIF" (Reinforcement Learning from AI Feedback) was popularized by the CAI paper. It refers to any RLHF-like technique where the preference labels come from an AI model rather than human labelers. Constitutional AI is the most developed example of RLAIF. Google followed up with their own RLAIF paper in 2023, showing RLAIF preference labels match human labels in quality on many tasks.


What Is a Constitution?

In Constitutional AI, a constitution is a set of explicit, written principles that define what a good AI response looks like. The principles cover harmlessness, helpfulness, and honesty. They are written in natural language, which means the AI can read and reason about them directly.

A sample of principles from Anthropic's published constitution:

  • "Choose the response that is least likely to contain harmful or unethical content, according to widely-shared moral intuitions."
  • "Choose the response that is most helpful, accurate, and in the best interest of the user and society."
  • "Which response is less likely to contain racist, sexist, or otherwise discriminatory content?"
  • "Choose the response that is less harmful for vulnerable groups in society."
  • "Choose the response that a reasonable, senior Anthropic employee would consider optimal given the goals of the user."

These are not vague slogans. They are specific enough that an AI model with strong language understanding can apply them to a concrete pair of responses and make a defensible judgment.

The constitution in the original paper contained 16 principles, drawn from multiple sources including the UN Declaration of Human Rights, Anthropic's internal guidelines, real-world content policies, and AI safety research. The key insight is that writing these principles is much cheaper and faster than collecting thousands of human preference labels - and the principles generalize much better because they express underlying values rather than specific examples.


The Two-Phase CAI Pipeline

Constitutional AI consists of two phases, each using the constitution differently.

Phase 1: SL-CAI (Supervised Learning from Constitutional AI)

The goal of Phase 1 is to reduce obvious harmfulness using a self-critique-and-revision loop. No human labels are required - the AI critiques and revises its own responses.

Step 1 - Elicit responses: Start with adversarial "red team" prompts - prompts designed to elicit harmful, unethical, or dangerous behavior. Feed these to a helpful-only model (a model fine-tuned to be helpful without harmlessness training). This model will produce responses that attempt to fulfill harmful requests, which gives us something concrete to critique.

Step 2 - Critique: Feed the response back to the model with a critique request. Randomly select a principle from the constitution and ask the model whether the response violates it:

[Original response generated by helpful-only model]

Critique Request: Identify specific ways in which the
assistant's last response is harmful, unethical, racist,
sexist, toxic, dangerous, or illegal.

Critique: [Model generates a critique here]

Step 3 - Revise: Ask the model to revise its response to address the critique:

Revision Request: Please rewrite the assistant response
to remove all harmful, unethical, racist, sexist, toxic,
dangerous, or illegal content.

Revision: [Model generates revised response here]

Step 4 - Repeat: Repeat the critique-revision cycle multiple times (the paper uses 1–4 revisions). Each iteration further refines the response.

Step 5 - Fine-tune: Collect the (prompt, final_revision) pairs and fine-tune the original model on them using standard SFT. This is the SL-CAI model - a model that generates less harmful responses, trained on AI-generated data rather than human demonstrations.

The key finding in the paper: even a single critique-revision cycle substantially reduces harmful content. Multiple cycles produce diminishing returns but continued improvement.

Phase 2: RL-CAI (Reinforcement Learning from Constitutional AI Feedback)

Phase 2 replaces the human preference labeling step in RLHF with AI-generated labels, guided by the constitution. This is RLAIF.

Step 1 - Generate response pairs: Sample pairs of responses to the same prompt from the SL-CAI model (or a helpful-only model).

Step 2 - AI preference labeling: Show the response pair to an AI "feedback model" (a capable LLM) along with a randomly selected constitution principle. Ask it to choose the better response:

Consider the following conversation and a pair of responses:

[Prompt]

Response A: [response 1]
Response B: [response 2]

Which response is less harmful, according to the following
principle: "Choose the response that is least likely to
produce harmful, dangerous, or misleading content."

(A) Response A
(B) Response B

The answer is: (

The model completes this with "A)" or "B)". This is the preference label.

Step 3 - Train reward model: Train a reward model on these AI-generated preference labels, exactly as in standard RLHF. The difference is that the labels came from an AI guided by explicit principles rather than human labelers.

Step 4 - RL fine-tuning: Apply PPO (or similar) with the trained reward model, with a KL penalty against the SL-CAI model as reference. This is standard RLHF from here.


Chain-of-Thought Critique: The Model Explains Its Reasoning

A key innovation in Constitutional AI is using chain-of-thought reasoning in the critique step. Instead of just asking the AI to label a preference, you ask it to first reason about which response is better:

Consider the following conversation:
[Prompt and responses]

Considering the following principle: "Choose the response
that is least likely to contain harmful content."

First, think through which response is better step by step.
Then identify which is less harmful.

Thinking: [Model reasons through the comparison]

The better response is: (A) or (B)

This chain-of-thought reasoning serves two purposes:

  1. Better labels: Models that reason before answering produce more accurate and consistent preference labels than models that answer directly.

  2. Transparency: The reasoning chain is human-readable. You can audit why the AI labeled one response as better, and catch systematic errors in the labeling.

The paper showed that chain-of-thought critique significantly improved the quality of AI-generated labels compared to direct labeling.


The Constitution vs. Human Preferences: What Changes?

The shift from human preferences to a written constitution has several important consequences:

More consistent harmlessness without helpfulness regression: Human labelers, when shown harmful content, sometimes rate "refuse to answer" as the best response even for relatively benign requests, because refusal is always safe. The constitution can be written to discourage over-refusal explicitly - "The best response refuses unnecessary to be helpful." This allows the model to be helpful in edge cases where a labeler might default to refusal.

The paper's key finding: RL-CAI models had higher "harmlessness" ratings than RLHF models at equivalent helpfulness levels. The constitution trades off helpfulness and harmlessness more efficiently than human labelers, who tend to conflate the two.

Different biases: Human labelers encode the biases of a specific demographic (typically contractors in specific countries and regions). A constitution encodes the explicit biases of its authors. Both are biased; the constitutional approach makes the biases explicit and auditable.

No crowdsourcing cost: Generating AI preference labels costs essentially nothing beyond inference compute. Human preference labels cost roughly 0.500.50–2.00 per comparison. At 100,000 comparisons (a small-scale RLAIF run), the cost difference is 50k50k–200k in annotation costs.


How This Enables Claude: The HHH Objective

Constitutional AI is the foundation of how Anthropic trains Claude. The HHH framework - Helpful, Harmless, Honest - defines Anthropic's alignment objectives:

Helpful: The model should provide real, substantive help. Not watered-down, hedge-everything responses that avoid any possible controversy, but genuinely useful assistance that makes a meaningful difference.

Harmless: The model should not help users cause harm, should not produce content that could facilitate harm, and should not produce content that is harmful in and of itself.

Honest: The model should not try to create false impressions, should be calibrated about uncertainty, and should not deceive users.

These three objectives are often in tension. A maximally helpful model would help with any request, including harmful ones. A maximally harmless model would refuse anything with any risk of harm. A maximally honest model might say things users don't want to hear.

Constitutional AI navigates this three-way tension by giving the model explicit principles to reason against, rather than a scalar reward to maximize. This allows for nuanced, context-dependent tradeoffs that are difficult to encode in a single reward signal.

The constitution for Claude includes principles like:

  • Prefer being helpful over being cautious when the request is benign
  • Distinguish between actual harm and discomfort
  • Be honest even when it's not what the user wants to hear
  • Consider the counterfactual impact (would the information be easily available elsewhere?)

Code Example: Constitutional Critique-Revision Loop

from openai import OpenAI # or anthropic.Anthropic for Claude

client = OpenAI() # Replace with your preferred provider

CONSTITUTION_PRINCIPLES = [
"Choose the response that is least likely to produce harmful, "
"dangerous, or misleading content.",
"Choose the response that is most helpful to the user without "
"violating ethical guidelines.",
"Prefer responses that avoid racism, sexism, or other forms of "
"discrimination.",
"Choose the response that would be appropriate for all audiences, "
"including vulnerable groups.",
]

def get_completion(messages: list[dict], model: str = "gpt-4o") -> str:
resp = client.chat.completions.create(model=model, messages=messages)
return resp.choices[0].message.content


def critique_response(prompt: str, response: str, principle: str) -> str:
"""Phase 1 Step 2: Critique the response against a constitution principle."""
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response},
{
"role": "user",
"content": (
f"Critique Request: Identify specific ways in which the assistant's "
f"response is harmful, unethical, or problematic according to this "
f"principle: '{principle}'. Be specific and direct."
)
}
]
return get_completion(messages)


def revise_response(
prompt: str,
original_response: str,
critique: str
) -> str:
"""Phase 1 Step 3: Revise the response to address the critique."""
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": original_response},
{
"role": "user",
"content": (
f"Critique: {critique}\n\n"
"Revision Request: Please rewrite the assistant response "
"to remove all harmful, problematic, or unethical content "
"while remaining as helpful as possible."
)
}
]
return get_completion(messages)


def constitutional_revision(
prompt: str,
initial_response: str,
n_revisions: int = 2,
verbose: bool = True
) -> str:
"""
Apply the SL-CAI critique-revision loop.
Each iteration critiques and revises the current best response.
"""
import random
current_response = initial_response

for i in range(n_revisions):
principle = random.choice(CONSTITUTION_PRINCIPLES)
if verbose:
print(f"\n--- Revision {i+1} ---")
print(f"Principle: {principle}")

critique = critique_response(prompt, current_response, principle)
if verbose:
print(f"Critique: {critique[:200]}...")

current_response = revise_response(prompt, current_response, critique)
if verbose:
print(f"Revised response: {current_response[:200]}...")

return current_response


def ai_preference_label(
prompt: str,
response_a: str,
response_b: str,
principle: str
) -> str:
"""
Phase 2: Generate an AI preference label for a response pair.
Returns 'A' or 'B'.
"""
system_msg = (
"You are evaluating two AI responses to determine which is better "
"according to a specific principle. Think step by step, then give "
"your final answer as just 'A' or 'B'."
)
user_msg = (
f"Prompt: {prompt}\n\n"
f"Response A: {response_a}\n\n"
f"Response B: {response_b}\n\n"
f"Principle: {principle}\n\n"
"Thinking step by step: which response better follows this principle? "
"After your reasoning, state your final answer as 'A' or 'B'."
)
messages = [
{"role": "system", "content": system_msg},
{"role": "user", "content": user_msg}
]
response = get_completion(messages)

# Parse the final answer
if response.strip().endswith("A") or " A." in response or "Response A" in response[-50:]:
return "A"
elif response.strip().endswith("B") or " B." in response or "Response B" in response[-50:]:
return "B"
else:
# Default if parsing fails
return "A"


def generate_rlaif_preference_dataset(
prompts: list[str],
model_responses_func, # Function that generates response pairs
principles: list[str] = None
) -> list[dict]:
"""
Generate RLAIF preference dataset.
Returns list of (prompt, chosen, rejected) triplets.
"""
import random
if principles is None:
principles = CONSTITUTION_PRINCIPLES

dataset = []
for prompt in prompts:
response_a, response_b = model_responses_func(prompt)
principle = random.choice(principles)

preferred = ai_preference_label(prompt, response_a, response_b, principle)

chosen = response_a if preferred == "A" else response_b
rejected = response_b if preferred == "A" else response_a

dataset.append({
"prompt": prompt,
"chosen": chosen,
"rejected": rejected,
"principle_used": principle,
"ai_preferred": preferred
})

return dataset

Comparing CAI to Standard RLHF

The original paper directly compared Constitutional AI to RLHF trained with human preference labels. Key findings:

Harmlessness: RL-CAI models had higher harmlessness ratings than RLHF models at the same helpfulness level. The constitution provides more consistent harmlessness guidance than human labelers, who vary in their sensitivity to different harm categories.

Helpfulness: RL-CAI models maintained higher helpfulness than the RLHF models. This is counterintuitive - you'd expect that human feedback would be better at capturing helpfulness since humans know what they find helpful. But the effect was driven by over-refusal in RLHF: human labelers tend to rate refusals as safe and therefore acceptable, even when refusal is unnecessary. The constitution can explicitly deprioritize unnecessary refusal.

Transparency: The constitutional principles are readable. When the model behaves unexpectedly, you can reason about which principle it might be following. With RLHF, the model's behavior reflects the aggregate of thousands of preference comparisons - interpretable only statistically.

Cost: AI feedback labels are essentially free relative to human labels. The main cost is inference compute for the feedback model. This allowed Anthropic to use 10–100× more preference comparisons than they could have with human labeling, which improved reward model quality.


Scalable Oversight: Using AI to Supervise AI

Constitutional AI is a specific instantiation of the broader scalable oversight agenda. The argument:

  1. As AI becomes more capable, the tasks it performs will exceed human ability to evaluate directly.
  2. If we can't evaluate AI outputs directly, RLHF becomes impossible (we can't generate reliable preference labels).
  3. We need techniques that allow AI to assist in its own supervision while remaining aligned with human values.

Constitutional AI addresses this by separating the value specification problem (writing the constitution, which humans can do) from the value application problem (judging whether a response follows the constitution, which AI can do).

This separation is crucial. Humans are good at stating values in the abstract ("responses should not be harmful") but poor at consistently applying those values across thousands of cases. AI models, once capable enough, are better at consistent application of stated values. Constitutional AI plays to each agent's comparative advantage.

The broader implication: as models become more capable, the constitution can be updated to address new risk categories, and the model can apply the updated constitution immediately - without retraining the reward model from scratch.


Production Engineering Notes

Implementing CAI in practice

The critique-revision loop can be implemented with any capable language model as the feedback model. Key considerations:

Feedback model quality: The quality of the AI preference labels depends on the quality of the feedback model. A weak feedback model produces noisy labels. For best results, use a model at least as capable as the model being aligned. Anthropic uses a version of Claude itself as the feedback model for training newer Claude versions.

Constitution design: The constitution must be written carefully. Principles that are too broad (harmful content) produce noisy labels because "harmful" is vague. Principles that are too narrow miss important harm categories. Multiple principles covering different harm taxonomies (physical harm, psychological harm, privacy, bias, deception) produce better coverage.

Consistency across revisions: Run multiple independent critique-revision cycles (with different randomly sampled principles) and select the best revision, rather than sequentially applying multiple critiques. Sequential critiques can cause the model to optimize for the last principle at the expense of earlier ones.

Balancing helpfulness: The most common failure mode of naive CAI implementations is over-refusal. Include explicit helpfulness principles in the constitution ("unnecessary refusal is a form of harm because it fails to help users with legitimate needs") to counteract the harmlessness bias.

When to use CAI vs. standard RLHF

Use CAI when:

  • Human annotation at scale is infeasible (domain expertise, cost, speed)
  • You need transparent, auditable alignment behavior
  • You're updating alignment objectives frequently and can't afford reward model retraining

Use standard RLHF when:

  • You have access to high-quality human preference data in your target domain
  • The task involves subjective quality that's hard to capture in written principles (creative writing, humor, personal communication style)
  • You need to capture the nuanced preference distribution of a specific user population

In practice, most production systems combine both: a constitution-based reward model provides baseline alignment, supplemented by targeted human preference data in high-stakes domains.


Common Mistakes

:::danger Using a weak model as the feedback model The quality of AI preference labels is bounded by the quality of the feedback model. If you use a small, weak model to generate preference labels, you'll train a reward model on noisy, inconsistent labels. Use the strongest available model as the feedback model, even if it's much larger than the model being aligned. :::

:::danger Writing vague constitution principles "Avoid harmful content" is not a useful constitution principle because it's ambiguous. What counts as harmful? The feedback model's interpretation will be inconsistent. Instead write specific, grounded principles: "Avoid content that provides instructions for creating weapons capable of mass casualties." The more specific the principle, the more consistent the AI labels. :::

:::warning Ignoring over-refusal in CAI implementations Naive constitutional implementations tend to produce over-refusal because harmlessness principles dominate helpfulness principles. Explicitly include helpfulness principles in your constitution and measure refusal rate on legitimate requests as a key evaluation metric. :::

:::warning Treating the constitution as fixed forever The constitution should be treated as a living document. As you discover failure modes in deployment, add or refine principles. As new harm categories emerge (e.g., deepfake-related harms as technology evolves), add principles addressing them. Constitutional AI's advantage over RLHF is precisely that the constitution can be updated without full reward model retraining. :::

:::tip Use chain-of-thought reasoning in AI labels Always use chain-of-thought prompting when generating AI preference labels. Ask the model to reason before labeling. This produces more consistent, accurate labels and gives you a readable audit trail. The reasoning chains also help you identify systematic errors in how the model is interpreting the constitution. :::


Interview Q&A

Q1: What problem does Constitutional AI solve that RLHF cannot?

Constitutional AI solves the human annotation bottleneck. Standard RLHF requires human labelers to evaluate response pairs across all tasks the model will perform. In specialized domains (medicine, law, chemistry), this requires domain-expert annotators who are expensive and scarce. For broad-coverage general assistants, it requires massive annotation workforces with consistent calibration.

Constitutional AI replaces human preference labeling with AI preference labeling guided by written principles (the constitution). The principles encode the values; the AI applies them. This is much cheaper (AI inference vs. human labeling), faster (parallelizable), and more consistent (the AI applies the same principles every time). It also makes the alignment objectives explicit and auditable - you can read the constitution and understand what the model is being trained to optimize.

Q2: Explain the difference between SL-CAI and RL-CAI.

SL-CAI (Supervised Learning from CAI) is Phase 1. It uses a self-critique-and-revision loop: the model generates a response, critiques it against a constitution principle, revises it, and repeats. The (prompt, final_revision) pairs are used as supervised fine-tuning data. No separate reward model is trained; the output is a fine-tuned model that generates less harmful responses.

RL-CAI (Reinforcement Learning from CAI) is Phase 2. It generates preference labels using an AI feedback model that evaluates response pairs against constitution principles. These AI-generated labels are used to train a reward model, which then guides PPO fine-tuning exactly as in standard RLHF. The difference from standard RLHF is only in the source of preference labels: AI instead of humans.

Q3: What is RLAIF, and how does it compare to RLHF in practice?

RLAIF (Reinforcement Learning from AI Feedback) is RLHF where the preference labels come from an AI model rather than human labelers. Constitutional AI is the most principled form of RLAIF - the AI generates labels guided by a written constitution.

In practice, Google's 2023 paper "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Lee et al.) showed that RLAIF preference labels are competitive with human labels on many tasks. On summarization, RLAIF models matched RLHF models in human evaluations. The gap was larger on tasks requiring nuanced human judgment (creative writing, cultural sensitivity). This suggests RLAIF is a practical replacement for RLHF in most alignment scenarios, with targeted human feedback retained for the highest-stakes quality dimensions.

Q4: How does the self-critique-and-revision loop work in Phase 1?

The SL-CAI loop works as follows. First, generate an initial response to an adversarial prompt using a helpful-but-unaligned model. Second, prompt the same model to critique the response against a randomly selected constitution principle - asking it to identify specific ways the response is harmful or problematic. Third, prompt the model to revise the response to address the critique. Repeat steps 2 and 3 for 1–4 iterations. Use the final revised response as a training example.

The critique step is the key innovation. By prompting the model to articulate why a response is problematic, you leverage the model's existing knowledge of what constitutes harm - knowledge it has from pre-training on human text - and force it to apply that knowledge to its own outputs. The revision then uses the articulated critique as a conditioning signal to generate a better response.

Q5: What are the limitations of Constitutional AI?

The main limitations are: (1) Feedback model dependency: AI labels are only as good as the feedback model. If the feedback model has blind spots or biases, they propagate into training. (2) Value specification difficulty: Writing a good constitution is hard. Principles that are too vague produce inconsistent labels; principles that are too narrow miss important cases. (3) Inability to capture subjective preferences: Some quality dimensions (humor, creativity, personal communication style) are hard to specify in a constitution because they depend on individual taste. (4) Potential for value lock-in: The constitution encodes the values of its authors. If those values are systematically wrong or incomplete, CAI will faithfully implement those wrong values at scale. (5) Self-referential problems: Using an AI to judge an AI's alignment can create loops where both models reinforce each other's biases.


Summary

Constitutional AI is Anthropic's solution to the human annotation bottleneck in RLHF. Instead of expensive, slow human preference labeling, it uses AI feedback guided by a written constitution of principles.

The two-phase pipeline:

  • Phase 1 (SL-CAI): Critique-revision loop using constitution principles. Generates supervised training data without human labels.
  • Phase 2 (RL-CAI): RLAIF - AI-generated preference labels train a reward model, which guides PPO fine-tuning.

Key advantages over RLHF:

  • Dramatically reduced annotation cost
  • Explicit, auditable alignment objectives
  • Better harmlessness-helpfulness tradeoff (less over-refusal)
  • Updateable without full reward model retraining

Key limitations:

  • Quality bounded by feedback model capability
  • Difficult to specify subjective quality dimensions in a constitution
  • Encodes values of constitution authors

Constitutional AI is the foundation of how Claude is trained, and its principles - separating value specification from value application, using AI to assist in its own alignment - are increasingly central to production alignment engineering as models become more capable.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Constitutional AI & Alignment demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.