Skip to main content

Debate and Critique Patterns

Reading time: ~35 minutes | Relevance: High for output quality improvement | Target roles: AI Engineer, Research Engineer, ML Engineer


The Scenario

You're using an LLM to generate investment research on a startup. The agent produces a confident, well-written analysis. It recommends investment. The analysis is smooth, logical, and completely wrong about the company's market size - by a factor of 10.

The agent didn't know it was wrong. It generated the number, found it plausible in context, and continued. No alarm bells.

Now you add a second agent whose only job is to critique the first agent's output. The critic reads the analysis and flags: "The cited market size of $500B seems inconsistent with the described niche product category. This warrants verification."

The critic caught what the generator missed. LLMs are confident even when wrong. But an LLM critiquing another LLM's output catches errors the original missed. The debate/critique pattern is one of the most effective quality improvement techniques in multi-agent systems.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Debate & Critique demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The critique pattern emerges from a fundamental property of how LLMs generate text: once a direction is established in the context, subsequent tokens reinforce it. An LLM that commits to "market size is $500B" in token 300 will find reasons to support that claim in tokens 400-1000. It's not lying - it's following probability distribution gradients.

A fresh LLM seeing the same text cold has no prior commitment. It reads "market size is $500B for a niche B2B product" and its probability distribution says: "this seems inconsistent." The cold read is more critical than the hot read.

This is also why peer review works in science: the author knows what they meant to say; the reviewer sees what was actually said.

The empirical evidence is strong: Society of Mind (Du et al., 2023) showed that multi-agent debate improves factual accuracy on math and reasoning tasks. Constitutional AI (Anthropic, 2022) showed that self-critique improves alignment. Reflexion (Shinn et al., 2023) showed that LLMs improve performance when given their own failure trace.


The Verifier/Critic Pattern

The simplest and most common critique pattern:

[Generator Agent] → draft

[Critic Agent] → {issues: [...], verdict: APPROVE/REVISE}

[Generator Agent] → revised draft (with critique in context)

The critic's job is focused: it doesn't rewrite, it doesn't improve. It identifies problems. The generator fixes them with the critique as context.

This separation matters:

  • Critic prompt: "What is wrong with this output? Be specific and harsh."
  • Generator prompt: "Here are the identified issues. Revise your output to address them."

Combining into one prompt ("Write good content and critique yourself") is less effective. The roles are cognitively distinct.

import anthropic
client = anthropic.Anthropic()
MODEL = "claude-opus-4-5"

def generator(task: str, prior_critique: str = None) -> str:
context = f"\nPrior critique to address:\n{prior_critique}\n" if prior_critique else ""
response = client.messages.create(
model=MODEL,
max_tokens=800,
system=(
"You are an expert writer and analyst. Produce high-quality, "
"accurate, specific output. Address any critique provided."
),
messages=[{"role": "user", "content": f"Task: {task}{context}"}]
)
return response.content[0].text

def critic(task: str, draft: str) -> dict:
response = client.messages.create(
model=MODEL,
max_tokens=600,
system=(
"You are a strict critic. Review outputs for: "
"(1) factual accuracy, (2) logical consistency, "
"(3) completeness, (4) unsupported claims. "
"Be specific. Output JSON: "
'{"verdict": "APPROVE" or "REVISE", "issues": ["issue1", ...]}'
),
messages=[{
"role": "user",
"content": f"Task: {task}\n\nOutput to critique:\n{draft}"
}]
)
import json, re
raw = response.content[0].text
match = re.search(r'\{.*\}', raw, re.DOTALL)
if match:
return json.loads(match.group())
return {"verdict": "APPROVE", "issues": []}

def critique_loop(task: str, max_rounds: int = 2) -> str:
draft = generator(task)
for round_num in range(max_rounds):
result = critic(task, draft)
print(f"[Round {round_num+1}] Verdict: {result['verdict']}")
if result["verdict"] == "APPROVE":
break
critique_text = "\n".join(f"- {issue}" for issue in result.get("issues", []))
draft = generator(task, prior_critique=critique_text)
return draft

Multi-Agent Debate

When the answer isn't obvious, multiple agents arguing for different positions and a synthesis agent deciding produces better results than any single agent.

[Agent A] → position_A (argues for answer X)
[Agent B] → position_B (argues for answer Y)
[Agent C] → position_C (challenges both X and Y)

[Judge Agent] → reads all positions → final_answer

This is most powerful for:

  • Ambiguous analytical questions ("Is this business model viable?")
  • Policy decisions with genuine tradeoffs
  • Tasks where multiple valid approaches exist
  • Any situation where the first answer shouldn't be the last answer
async def multi_agent_debate(question: str, num_rounds: int = 2) -> dict:
"""
Three agents debate a question over multiple rounds.
Judge synthesizes at the end.
"""
import anthropic
async_client = anthropic.AsyncAnthropic()

async def debater(name: str, role: str, question: str, debate_history: str = "") -> str:
history_context = f"\n\nDebate so far:\n{debate_history}" if debate_history else ""
r = await async_client.messages.create(
model=MODEL, max_tokens=500,
system=f"You are {name}. {role} Argue your position clearly and specifically.",
messages=[{"role": "user", "content": f"Question: {question}{history_context}\n\nState your position:"}]
)
return r.content[0].text

async def judge(question: str, full_debate: str) -> str:
r = await async_client.messages.create(
model=MODEL, max_tokens=600,
system=(
"You are an impartial judge. Review the debate and provide a final, "
"well-reasoned conclusion that incorporates the strongest arguments from all sides."
),
messages=[{"role": "user", "content": f"Question: {question}\n\nDebate:\n{full_debate}\n\nFinal judgment:"}]
)
return r.content[0].text

roles = [
("Advocate", "Argue strongly FOR a positive answer. Find supporting evidence and reasoning."),
("Skeptic", "Argue strongly AGAINST or identify flaws and risks. Be critical."),
("Contrarian", "Challenge assumptions of both sides. Identify what both are missing.")
]

debate_log = []
debate_history = ""

for round_num in range(num_rounds):
print(f"\n[Debate] Round {round_num + 1}")
round_positions = await asyncio.gather(*[
debater(name, role, question, debate_history)
for name, role in roles
])

for (name, _), position in zip(roles, round_positions):
entry = f"[{name}]: {position}"
debate_log.append(entry)
print(f" {name}: {position[:100]}...")

debate_history = "\n\n".join(debate_log)

print("\n[Debate] Judge synthesizing...")
final = await judge(question, debate_history)

return {
"question": question,
"rounds": num_rounds,
"positions": debate_log,
"final_judgment": final
}

Ensemble Approaches

Run the same task with N agents. Each produces an independent answer. Aggregate via majority vote or synthesis.

import asyncio
from collections import Counter

async def ensemble_answer(question: str, n: int = 5) -> dict:
"""N independent agents answer the same question. Majority vote."""
async_client = anthropic.AsyncAnthropic()

async def one_agent(agent_id: int) -> str:
r = await async_client.messages.create(
model=MODEL, max_tokens=200,
system=f"Agent {agent_id}: Answer concisely and directly.",
messages=[{"role": "user", "content": question}]
)
return r.content[0].text

answers = await asyncio.gather(*[one_agent(i) for i in range(n)])

# For factual/numerical questions: look for consensus
print(f"\n[Ensemble] {n} answers collected:")
for i, ans in enumerate(answers):
print(f" Agent {i+1}: {ans[:80]}...")

# Synthesize with a judge
synthesis_prompt = (
f"Question: {question}\n\n"
f"Answers from {n} independent agents:\n"
+ "\n".join(f"Agent {i+1}: {ans}" for i, ans in enumerate(answers))
+ "\n\nSynthesize the most accurate answer based on consensus:"
)
r = await async_client.messages.create(
model=MODEL, max_tokens=300,
messages=[{"role": "user", "content": synthesis_prompt}]
)

return {
"n_agents": n,
"individual_answers": list(answers),
"synthesized_answer": r.content[0].text
}

Self-Critique: Less Effective, But Useful

A single agent critiquing its own output works - but is significantly less effective than a separate critic. The main value: catching obvious errors and forcing structured reflection.

def self_critique(task: str) -> str:
"""
Single agent writes, then critiques its own output in a second call.
Less effective than separate critic, but cheap.
"""
# Step 1: Generate
draft_response = client.messages.create(
model=MODEL, max_tokens=800,
messages=[{"role": "user", "content": task}]
)
draft = draft_response.content[0].text

# Step 2: Self-critique (same model, fresh call)
critique_response = client.messages.create(
model=MODEL, max_tokens=400,
system=(
"You will read your own previous response critically. "
"Identify what's wrong, missing, or could be improved. "
"Be harsh - you want to catch real problems."
),
messages=[
{"role": "user", "content": task},
{"role": "assistant", "content": draft},
{"role": "user", "content": "Now critique your response above. What's wrong or missing?"}
]
)
critique = critique_response.content[0].text

# Step 3: Revise
final_response = client.messages.create(
model=MODEL, max_tokens=800,
messages=[
{"role": "user", "content": task},
{"role": "assistant", "content": draft},
{"role": "user", "content": f"Here's what was wrong:\n{critique}\n\nRevise your response:"}
]
)
return final_response.content[0].text

When self-critique is worth it: When you can't afford multiple agent calls but want some quality improvement. When the task has clear correctness criteria that the model can check itself (math, logic, format compliance).

When separate critic wins: When quality is critical. When the error modes are subtle (nuanced factual errors, logical fallacies). When you have the token budget.


Constitutional Critique

Critique against a fixed set of principles. Used in Anthropic's Constitutional AI to align models, and applicable to domain-specific quality standards.

ENGINEERING_CONSTITUTION = [
"Does the code handle edge cases (empty inputs, None values, large inputs)?",
"Are error messages helpful and specific?",
"Is the time complexity acceptable for the expected input size?",
"Are variable names clear and self-documenting?",
"Does the code include docstrings for public functions?",
"Are there any obvious security issues (SQL injection, path traversal)?",
"Is the code testable (no hidden globals, no hardcoded values)?",
]

def constitutional_critique(code: str) -> dict:
"""Critique code against a fixed set of engineering principles."""
principles_text = "\n".join(f"{i+1}. {p}" for i, p in enumerate(ENGINEERING_CONSTITUTION))

response = client.messages.create(
model=MODEL, max_tokens=600,
system=(
"You are a code reviewer. Evaluate code against each principle. "
"For each: PASS, FAIL, or N/A with brief explanation. "
"Output JSON: {\"principle_N\": {\"result\": \"PASS/FAIL/NA\", \"note\": \"...\"}}"
),
messages=[{
"role": "user",
"content": f"Code:\n```python\n{code}\n```\n\nPrinciples:\n{principles_text}"
}]
)

import json, re
raw = response.content[0].text
match = re.search(r'\{.*\}', raw, re.DOTALL)
results = json.loads(match.group()) if match else {}
failures = {k: v for k, v in results.items() if v.get("result") == "FAIL"}
return {"results": results, "failures": failures, "pass_rate": (len(results) - len(failures)) / max(len(results), 1)}

Convergence Detection

Debate loops need stopping criteria:

from dataclasses import dataclass
from enum import Enum

class ConvergenceReason(str, Enum):
APPROVED = "critic_approved"
MAX_ROUNDS = "max_rounds_reached"
CONSENSUS = "debate_consensus"
NO_IMPROVEMENT = "no_improvement"

@dataclass
class ConvergenceResult:
converged: bool
reason: ConvergenceReason
rounds: int
final_output: str

def detect_convergence(
critiques: list[str],
current_draft: str,
prev_draft: str = None,
max_rounds: int = 3
) -> tuple[bool, str]:
"""
Returns (should_stop, reason).
Multiple stopping criteria:
1. Critic says APPROVE
2. Max rounds reached
3. Critique is identical to previous round (stuck)
4. Draft barely changed (diminishing returns)
"""
latest_critique = critiques[-1] if critiques else ""

# Criterion 1: Explicit approval
if "APPROVE" in latest_critique[:100]:
return True, ConvergenceReason.APPROVED

# Criterion 2: Max rounds
if len(critiques) >= max_rounds:
return True, ConvergenceReason.MAX_ROUNDS

# Criterion 3: Stuck critique (critic repeating same issues)
if len(critiques) >= 2 and critiques[-1][:200] == critiques[-2][:200]:
return True, ConvergenceReason.NO_IMPROVEMENT

# Criterion 4: Draft barely changed (text similarity)
if prev_draft and current_draft:
overlap = len(set(current_draft.split()) & set(prev_draft.split()))
total = len(set(current_draft.split()) | set(prev_draft.split()))
similarity = overlap / max(total, 1)
if similarity > 0.95: # > 95% word overlap = not improving
return True, ConvergenceReason.NO_IMPROVEMENT

return False, ""

Full Python Code: 3-Agent Debate System

"""
debate_system.py

A 3-agent debate system with convergence detection.
Proposer argues → Challenger critiques → Judge decides.
Multiple rounds with quality tracking.
"""

import asyncio
import time
from dataclasses import dataclass, field
from typing import Optional
import anthropic

async_client = anthropic.AsyncAnthropic()
MODEL = "claude-opus-4-5"


@dataclass
class DebateRound:
round_num: int
proposal: str
challenge: str
judge_feedback: str
verdict: str # CONTINUE / RESOLVE
resolution: Optional[str] = None


@dataclass
class DebateResult:
question: str
rounds: list[DebateRound] = field(default_factory=list)
final_answer: Optional[str] = None
convergence_reason: str = ""
total_duration_ms: int = 0


class ThreeAgentDebateSystem:
"""
Proposer → Challenger → Judge cycle with convergence detection.
Designed for analytical questions with genuine uncertainty.
"""

def __init__(self, max_rounds: int = 3):
self.max_rounds = max_rounds

async def _call_agent(self, system: str, user: str, max_tokens: int = 500) -> str:
r = await async_client.messages.create(
model=MODEL, max_tokens=max_tokens,
system=system,
messages=[{"role": "user", "content": user}]
)
return r.content[0].text

async def proposer(self, question: str, challenge_history: str = "") -> str:
history = f"\n\nPrevious challenges to address:\n{challenge_history}" if challenge_history else ""
return await self._call_agent(
system=(
"You are the Proposer. Your job is to advance a clear, specific, "
"well-reasoned position on the question. Provide concrete evidence "
"and logical arguments. Address any challenges raised against your position."
),
user=f"Question: {question}{history}\n\nState/defend your position:"
)

async def challenger(self, question: str, proposal: str) -> str:
return await self._call_agent(
system=(
"You are the Challenger. Your job is to identify weaknesses in the proposal: "
"unsupported claims, missing evidence, alternative explanations, logical fallacies. "
"Be specific. Point to exact statements. Offer counter-evidence where possible."
),
user=f"Question: {question}\n\nProposal:\n{proposal}\n\nIdentify specific weaknesses:"
)

async def judge(self, question: str, debate_history: str, round_num: int) -> dict:
prompt = (
f"Question: {question}\n\nDebate so far:\n{debate_history}\n\n"
f"Round {round_num} assessment: Has this debate converged to a clear answer? "
"Output JSON: "
'{\"verdict\": \"CONTINUE\" or \"RESOLVE\", '
'\"judgment\": \"brief assessment\", '
'\"resolution\": \"final answer if RESOLVE else null\"}'
)
raw = await self._call_agent(
system=(
"You are an impartial Judge. Assess whether the debate has produced "
"a clear, well-supported conclusion or needs more rounds. "
"If the core question is answered with sufficient evidence: RESOLVE. "
"If material uncertainties remain: CONTINUE."
),
user=prompt, max_tokens=400
)
import json, re
match = re.search(r'\{.*\}', raw, re.DOTALL)
if match:
return json.loads(match.group())
return {"verdict": "CONTINUE", "judgment": raw, "resolution": None}

async def run(self, question: str) -> DebateResult:
result = DebateResult(question=question)
start = time.time()

challenge_history = ""
proposal_history = ""
all_debate_text = ""

print(f"\n[Debate] Question: {question[:80]}...")

for round_num in range(1, self.max_rounds + 1):
print(f"\n[Debate] Round {round_num}/{self.max_rounds}")

# Run proposer and challenger in parallel
proposal_task = asyncio.create_task(
self.proposer(question, challenge_history)
)
# For round 1, challenger needs proposer's output - can't parallelize
# For subsequent rounds, we could parallelize differently
proposal = await proposal_task
print(f" [Proposer] {proposal[:80]}...")

challenge = await self.challenger(question, proposal)
print(f" [Challenger] {challenge[:80]}...")

debate_entry = f"\n--- Round {round_num} ---\nProposal: {proposal}\nChallenge: {challenge}"
all_debate_text += debate_entry
challenge_history += f"\nRound {round_num} challenge: {challenge[:300]}"

# Judge evaluates
judge_result = await self.judge(question, all_debate_text, round_num)
print(f" [Judge] Verdict: {judge_result['verdict']}")

debate_round = DebateRound(
round_num=round_num,
proposal=proposal,
challenge=challenge,
judge_feedback=judge_result.get("judgment", ""),
verdict=judge_result["verdict"],
resolution=judge_result.get("resolution")
)
result.rounds.append(debate_round)

if judge_result["verdict"] == "RESOLVE":
result.final_answer = judge_result.get("resolution", proposal)
result.convergence_reason = "judge_resolved"
break

if not result.final_answer:
# Max rounds reached - final synthesis
print("\n[Debate] Max rounds reached. Final synthesis...")
result.final_answer = await self._call_agent(
system="Synthesize the debate into the best supported final answer.",
user=f"Question: {question}\n\nFull debate:\n{all_debate_text}\n\nBest answer:"
)
result.convergence_reason = "max_rounds"

result.total_duration_ms = int((time.time() - start) * 1000)
print(f"\n[Debate] Complete in {result.total_duration_ms}ms ({result.convergence_reason})")
return result


# ─── Usage ────────────────────────────────────────────────────────────────────

async def main():
system = ThreeAgentDebateSystem(max_rounds=2)

question = (
"Should an AI agent system use a single powerful model for all tasks, "
"or a hierarchy of specialized smaller models? Consider cost, quality, and reliability."
)

result = await system.run(question)

print(f"\n{'='*60}")
print("DEBATE RESULT")
print("="*60)
print(f"Question: {result.question}")
print(f"Rounds: {len(result.rounds)}")
print(f"Convergence: {result.convergence_reason}")
print(f"\nFinal Answer:\n{result.final_answer}")


if __name__ == "__main__":
asyncio.run(main())

Debate/Critique Cycle


When Critique Helps vs When It's Noise

Critique helps when:

  • Factual claims exist that can be checked (numbers, dates, attributions)
  • Logical structure matters (reasoning chains that can be wrong)
  • Format compliance is required (JSON schemas, required sections)
  • Quality bar is high enough to justify the extra call cost

Critique adds noise when:

  • The task is purely creative with no objectively better answer
  • The original output is already high quality for a low-stakes task
  • The critic doesn't have better knowledge than the generator (same model, same training)
  • Speed is critical and the critique round would miss a deadline

The empirical signal: measure quality with and without critique on your specific task. If critique doesn't improve quality by more than the extra cost, skip it.


Production Notes

Use a stronger model as critic: If your generator is Claude Haiku for cost efficiency, consider using Claude Sonnet or Opus as the critic. The critic needs to catch what the generator missed - a stronger model does this better.

Critique is not revision: The critic identifies problems. The generator fixes them. Don't ask the critic to also produce the fixed version - it leads to scope creep and lower quality on both dimensions.

Track critique quality over time: Does your critic catch real issues or generate spurious noise? Monitor the rate at which critique-requested revisions actually improve final quality (by measuring against ground truth or human rating).


:::warning Debate Can Entrench Wrong Answers In multi-agent debate, if one agent's confident-sounding wrong position dominates early rounds, other agents may shift toward it. This is group convergence bias - the LLM equivalent of groupthink. Mitigate by assigning fixed adversarial roles (Advocate, Skeptic, Contrarian) that agents cannot abandon, regardless of what other agents say. :::

:::danger Infinite Critique Loops Without hard stopping criteria, critique loops can run indefinitely - the critic always finds something to improve, the generator always revises, the critic finds new issues. Always set max_rounds as an absolute limit. Track token spend per pipeline run. A critique loop that runs 20 rounds instead of expected 2 rounds will cost 10x the budget. :::


Interview Q&A

Q: Why does a separate critic agent catch more errors than self-critique?

A: When an LLM generates output, it commits to a direction through its probability distributions. Subsequent tokens reinforce earlier ones - the model "knows what it meant to say" and reads its own output charitably. A separate agent reads the output cold with no prior commitment, making it more likely to notice inconsistencies, gaps, and errors that feel natural to the generator but look odd from the outside.

Q: What is the constitutional critique pattern?

A: Constitutional critique evaluates output against a fixed set of explicit principles rather than open-ended quality criteria. The critic doesn't decide what matters - it evaluates against a predefined rubric. This makes critique more consistent, auditable, and less biased by the critic's current "mood." Used in Anthropic's Constitutional AI for alignment, and useful in production for domain-specific quality standards (code review checklists, safety criteria, format compliance).

Q: When does multi-agent debate produce worse results than a single agent?

A: When there's a clear objectively correct answer and one agent is more likely to know it than another - debate adds noise, not signal. When the task is creative with no objective quality criteria - debate generates arbitrary disagreement. When agents all have the same training and therefore the same errors - debate produces false confidence in shared misconceptions. Debate works best on genuinely uncertain analytical questions where multiple valid perspectives exist.

Q: How do you detect convergence in a debate loop?

A: Multiple signals: explicit approval from the judge ("RESOLVE"), max rounds reached (hard limit), critique becoming repetitive (the critic surfaces the same issues in consecutive rounds), or the draft barely changing between rounds (text similarity above threshold). In production, set all four as stopping conditions with the hardest limit (max_rounds) as the failsafe.

Q: What's the difference between ensemble approaches and debate?

A: Ensemble runs N independent agents on the same task and aggregates results (majority vote or synthesis). There's no interaction between agents - each answers independently. Debate has agents interact: one proposes, another challenges, which forces the proposer to defend and potentially improve their position. Ensemble is better for factual questions with computable answers (where majority vote works). Debate is better for analytical questions where the back-and-forth refines the quality of reasoning.

© 2026 EngineersOfAI. All rights reserved.