What is Tree of Thought?

Master self-refinement, Tree of Thought, ReAct, meta-prompting, and other advanced techniques for reliable, sophisticated LLM behavior in production.

How does ReAct prompting work in practice?

Advanced Prompting Techniques covers Tree of Thought, ReAct prompting, self-refinement from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/prompt-engineering/advanced-prompting-techniques

What is the difference between Tree of Thought and self-refinement?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/prompt-engineering/advanced-prompting-techniques

:::tip 🎮 Interactive Playground Visualize this concept: Try the Tree of Thought demo on the EngineersOfAI Playground - no code required. :::

Advanced Prompting Techniques

The Code Reviewer That Reviewed Itself

The team was building an AI-powered code review system. The initial version was straightforward: send the PR diff, get back a list of issues. It worked reasonably well for obvious bugs but struggled with subtle problems - the kind that a senior engineer would catch but that required understanding code intent, not just syntax.

The lead engineer tried an experiment: after the first review pass, she added a second prompt: "Here is a code review you just wrote. Identify any issues you missed or where your analysis might be incomplete." The model looked at its own review and found three things it had missed: a race condition it hadn't considered, an edge case where the error handling was insufficient, and a performance issue that only appeared at scale.

She took it further. She built a three-pass system: first draft, critique, revision. The critique step told the model to play devil's advocate - actively argue against its own initial review. The revision incorporated the critiques. The final reviews were noticeably better - catching issues that neither a single-pass AI nor the first review alone would have found.

What she had discovered was self-refinement: the model as its own critic. This is one of a family of advanced techniques that emerge when you stop thinking of an LLM as a black box that produces one output, and start thinking of it as a reasoning system you can orchestrate.

The Landscape of Advanced Techniques

Self-refinement generates an output, critiques it, and revises based on the critique. Iterated until a quality threshold is met or iteration limit is reached.

The key insight: models are often better at evaluating outputs than producing perfect outputs on the first try. The evaluator and generator can be the same model - or a separate one.

import anthropic
from dataclasses import dataclass
from typing import Optional, Callable

client = anthropic.Anthropic()


@dataclass
class RefinementResult:
    final_output: str
    iterations: int
    quality_score: float
    refinement_history: list[dict]


def self_refine(
    task: str,
    context: str,
    max_iterations: int = 3,
    quality_threshold: float = 0.85,
    model: str = "claude-opus-4-6",
) -> RefinementResult:
    """
    Self-refinement loop:
    1. Generate initial output
    2. Critique the output
    3. Revise based on critique
    4. Repeat until quality threshold or max iterations
    """

    history = []
    current_output = ""

    for iteration in range(max_iterations):
        if iteration == 0:
            # Initial generation
            response = client.messages.create(
                model=model,
                max_tokens=800,
                messages=[{
                    "role": "user",
                    "content": f"""Complete this task:

Context: {context}
Task: {task}

Provide your best output."""
                }]
            )
            current_output = response.content[0].text

        # Critique step
        critique_response = client.messages.create(
            model=model,
            max_tokens=400,
            messages=[{
                "role": "user",
                "content": f"""You are a critical reviewer. Evaluate this output:

Task: {task}
Context: {context}

Output to evaluate:
{current_output}

Be a tough critic. Identify:
1. What is wrong, incomplete, or could be improved
2. What is correct and should be kept
3. Specific suggestions for improvement

Also provide a quality score 0.0-1.0.
Format:
SCORE: X.XX
ISSUES:
- [issue 1]
SUGGESTIONS:
- [suggestion 1]
KEEP:
- [what's good]"""
            }]
        )

        critique = critique_response.content[0].text

        # Parse quality score
        score = 0.5  # default
        for line in critique.split('\n'):
            if line.startswith('SCORE:'):
                try:
                    score = float(line.replace('SCORE:', '').strip())
                except ValueError:
                    pass
                break

        history.append({
            "iteration": iteration,
            "output": current_output,
            "critique": critique,
            "score": score,
        })

        if score >= quality_threshold:
            break  # Good enough, stop refining

        if iteration < max_iterations - 1:
            # Revision step
            revision_response = client.messages.create(
                model=model,
                max_tokens=800,
                messages=[{
                    "role": "user",
                    "content": f"""Improve this output based on the critique:

Original task: {task}
Context: {context}

Previous output:
{current_output}

Critique:
{critique}

Write an improved version that addresses all the issues raised."""
                }]
            )
            current_output = revision_response.content[0].text

    final_score = history[-1]["score"] if history else 0.5

    return RefinementResult(
        final_output=current_output,
        iterations=len(history),
        quality_score=final_score,
        refinement_history=history,
    )


# Specialized: code review self-refinement
def refine_code_review(
    code: str,
    language: str = "Python",
    iterations: int = 2,
) -> str:
    """
    Generate a code review, then have the model critique its own review
    and add anything it missed.
    """

    # First pass review
    review_1 = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=800,
        system=f"You are a senior {language} engineer. Conduct thorough code reviews.",
        messages=[{
            "role": "user",
            "content": f"Review this {language} code:\n\n```{language.lower()}\n{code}\n```"
        }]
    )
    first_review = review_1.content[0].text

    for _ in range(iterations):
        # Adversarial critique of the review itself
        critique = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=400,
            messages=[{
                "role": "user",
                "content": f"""You wrote this code review:
{first_review}

Play devil's advocate. What security issues, race conditions, edge cases, or performance problems did you miss? Be harsh - assume there's something important you overlooked."""
            }]
        )
        missed = critique.content[0].text

        # Incorporate into final review
        final = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"""Combine your initial review with the additional issues you identified:

Initial review:
{first_review}

Additional issues found on reflection:
{missed}

Write the complete, final code review incorporating all findings."""
            }]
        )
        first_review = final.content[0].text

    return first_review

2. Tree of Thought (ToT)

Tree of Thought extends Chain-of-Thought by exploring multiple reasoning paths simultaneously, evaluating their promise, and backtracking when a path is unproductive. It's a deliberate search over the reasoning space.

import anthropic
from dataclasses import dataclass, field
from typing import Optional
import asyncio

client = anthropic.AsyncAnthropic()


@dataclass
class ThoughtNode:
    thought: str
    depth: int
    score: float              # 0.0 to 1.0 - how promising is this path?
    parent: Optional['ThoughtNode'] = None
    children: list['ThoughtNode'] = field(default_factory=list)
    is_solution: bool = False
    solution_text: str = ""


async def generate_thoughts(
    problem: str,
    context: str,
    n_thoughts: int = 3,
    depth: int = 0,
    model: str = "claude-opus-4-6",
) -> list[str]:
    """Generate N candidate next thoughts for a problem."""
    response = await client.messages.create(
        model=model,
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": f"""Problem: {problem}
Context: {context}

Generate {n_thoughts} distinct approaches or next steps to explore.
Think like a skilled problem solver considering multiple strategies.
Number each approach 1-{n_thoughts} and be concrete.

Approach 1:
Approach 2:
Approach 3:"""
        }]
    )
    text = response.content[0].text
    # Parse numbered approaches
    approaches = []
    for line in text.split('\n'):
        for i in range(1, n_thoughts + 1):
            if line.startswith(f"Approach {i}:") or line.startswith(f"{i}."):
                approaches.append(line.split(':', 1)[-1].strip())
    return approaches[:n_thoughts] if approaches else [text]


async def evaluate_thought(
    problem: str,
    thought: str,
    model: str = "claude-haiku-4-5-20251001",  # Use cheaper model for evaluation
) -> float:
    """Evaluate how promising a thought/approach is. Returns 0.0-1.0."""
    response = await client.messages.create(
        model=model,
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Problem: {problem}
Proposed approach: {thought}

Rate how promising this approach is on a scale of 0.0 to 1.0.
Consider: Does it make progress toward solving the problem? Is it feasible? Is it on the right track?

Reply with just the score and a brief reason.
Score: X.X
Reason: [one sentence]"""
        }]
    )
    text = response.content[0].text
    for line in text.split('\n'):
        if 'Score:' in line or 'score:' in line:
            try:
                parts = line.split(':')
                return float(parts[1].strip().split()[0])
            except (ValueError, IndexError):
                pass
    return 0.5  # Default if parsing fails


async def tree_of_thought(
    problem: str,
    max_depth: int = 3,
    n_thoughts_per_step: int = 3,
    beam_width: int = 2,    # How many best paths to keep at each level
    model: str = "claude-opus-4-6",
) -> dict:
    """
    Beam search through thought space.

    Args:
        problem: The problem to solve
        max_depth: Maximum depth of thought tree
        n_thoughts_per_step: Number of thoughts to generate at each node
        beam_width: Top-k paths to explore at each depth level
        model: Model to use for thought generation

    Returns:
        Best solution found with its reasoning path
    """

    # Initialize with root thoughts
    root_thoughts = await generate_thoughts(problem, "Starting fresh", n_thoughts_per_step, model=model)

    # Score all root thoughts in parallel
    scores = await asyncio.gather(*[evaluate_thought(problem, t, "claude-haiku-4-5-20251001") for t in root_thoughts])

    # Create nodes
    nodes = [ThoughtNode(thought=t, depth=0, score=s) for t, s in zip(root_thoughts, scores)]

    # Beam search
    current_beam = sorted(nodes, key=lambda n: n.score, reverse=True)[:beam_width]

    for depth in range(1, max_depth + 1):
        next_beam_candidates = []

        for node in current_beam:
            # Generate child thoughts
            context = f"Previous step: {node.thought}\nBuilding on this, what are the next steps?"
            child_thoughts = await generate_thoughts(problem, context, n_thoughts_per_step, depth, model)

            # Score children in parallel
            child_scores = await asyncio.gather(*[
                evaluate_thought(f"{problem}\nSo far: {node.thought}", ct)
                for ct in child_thoughts
            ])

            for ct, cs in zip(child_thoughts, child_scores):
                child_node = ThoughtNode(
                    thought=ct,
                    depth=depth,
                    score=cs,
                    parent=node
                )
                node.children.append(child_node)
                next_beam_candidates.append(child_node)

        # Keep top beam_width
        current_beam = sorted(next_beam_candidates, key=lambda n: n.score, reverse=True)[:beam_width]

        # Check if any beam member is a complete solution
        if depth == max_depth:
            break

    # Take the best node from final beam, reconstruct path
    best = current_beam[0] if current_beam else nodes[0]

    # Reconstruct reasoning path
    path = []
    node = best
    while node:
        path.insert(0, node.thought)
        node = node.parent

    # Generate final answer from best path
    final_response = await client.messages.create(
        model=model,
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": f"""Problem: {problem}

I explored multiple approaches and found this reasoning path to be most promising:

{chr(10).join(f'Step {i+1}: {t}' for i, t in enumerate(path))}

Based on this reasoning path, provide the final, complete solution."""
        }]
    )

    return {
        "solution": final_response.content[0].text,
        "reasoning_path": path,
        "best_score": best.score,
        "depth_explored": max_depth,
    }

3. ReAct: Reasoning + Acting

ReAct interleaves reasoning (thought) with action (tool call) steps. For every action, the model first reasons about what to do and why, then acts, then observes the result and reasons again.

import anthropic
from typing import Callable, Any
from dataclasses import dataclass

client = anthropic.Anthropic()


@dataclass
class Tool:
    name: str
    description: str
    func: Callable[..., Any]


def react_agent(
    question: str,
    tools: list[Tool],
    max_steps: int = 8,
    model: str = "claude-opus-4-6",
) -> dict:
    """
    ReAct agent: think → act → observe → think → act → ...

    The model explicitly reasons about what to do before each action,
    and incorporates observations into subsequent reasoning.
    """

    tool_descriptions = "\n".join([
        f"- {t.name}: {t.description}"
        for t in tools
    ])
    tool_map = {t.name: t for t in tools}

    system = f"""You are a research assistant with access to tools.

Available tools:
{tool_descriptions}

For each step, follow this format exactly:
Thought: [reason about what to do next]
Action: [tool_name]
Action Input: [input to the tool]

After seeing the observation, continue with another Thought/Action pair.
When you have enough information to answer, use:
Thought: I now have enough information to answer.
Answer: [final answer]"""

    messages = [{"role": "user", "content": question}]
    steps = []
    current_text = ""

    for step in range(max_steps):
        # Get next thought/action
        response = client.messages.create(
            model=model,
            max_tokens=400,
            system=system,
            messages=messages + (
                [{"role": "assistant", "content": current_text}] if current_text else []
            )
        )
        text = response.content[0].text

        # Check if we have a final answer
        if "Answer:" in text:
            answer_start = text.index("Answer:") + len("Answer:")
            final_answer = text[answer_start:].strip()
            steps.append({"step": step, "type": "answer", "content": text})
            return {
                "answer": final_answer,
                "steps": steps,
                "iterations": step + 1,
            }

        # Parse thought and action
        thought = ""
        action = ""
        action_input = ""

        for line in text.split('\n'):
            if line.startswith("Thought:"):
                thought = line.replace("Thought:", "").strip()
            elif line.startswith("Action:"):
                action = line.replace("Action:", "").strip()
            elif line.startswith("Action Input:"):
                action_input = line.replace("Action Input:", "").strip()

        if not action or action not in tool_map:
            # Model didn't follow format - try to continue
            current_text = (current_text + "\n" + text).strip()
            continue

        # Execute the tool
        tool = tool_map[action]
        try:
            observation = tool.func(action_input)
        except Exception as e:
            observation = f"Error executing {action}: {str(e)}"

        step_record = {
            "step": step,
            "type": "react_step",
            "thought": thought,
            "action": action,
            "action_input": action_input,
            "observation": str(observation)[:500],  # Truncate long observations
        }
        steps.append(step_record)

        # Add to conversation history
        step_text = f"Thought: {thought}\nAction: {action}\nAction Input: {action_input}"
        observation_text = f"Observation: {observation}"

        messages.append({"role": "assistant", "content": step_text})
        messages.append({"role": "user", "content": observation_text})
        current_text = ""

    return {
        "answer": "Max steps reached without final answer",
        "steps": steps,
        "iterations": max_steps,
    }


# Example tools for a research agent
import json

def search_web(query: str) -> str:
    """Mock web search."""
    return f"Search results for '{query}': [Result 1] Latest data shows... [Result 2] According to..."

def read_document(url: str) -> str:
    """Mock document reader."""
    return f"Document at {url} contains: [Full text of document...]"

def calculate(expression: str) -> str:
    """Safe math calculator."""
    try:
        # Only allow basic math operations
        allowed = set('0123456789+-*/().% ')
        if all(c in allowed for c in expression):
            result = eval(expression)  # noqa: S307
            return str(result)
        return "Invalid expression - only basic math allowed"
    except Exception as e:
        return f"Calculation error: {e}"


research_tools = [
    Tool("search_web", "Search the internet for current information", search_web),
    Tool("read_document", "Read the full content of a URL", read_document),
    Tool("calculate", "Perform mathematical calculations", calculate),
]

# Run the agent
result = react_agent(
    question="What is 15% of the current inflation rate in the US multiplied by the 2024 GDP?",
    tools=research_tools,
)

4. Prompt Chaining

Prompt chaining breaks a complex task into sequential sub-tasks, where the output of each step feeds into the next. Unlike ReAct (which is iterative based on observations), chaining has a fixed pipeline structure.

import anthropic
from typing import Any, Optional

client = anthropic.Anthropic()


class PromptChain:
    """
    A pipeline of prompts where each step's output feeds into the next.
    """

    def __init__(self, model: str = "claude-opus-4-6"):
        self.steps: list[dict] = []
        self.model = model
        self.client = anthropic.Anthropic()

    def add_step(
        self,
        name: str,
        system: str,
        user_template: str,     # {previous_output} is replaced with last step's output
        max_tokens: int = 600,
        model: Optional[str] = None,
        transform: Optional[callable] = None,  # Post-process the output
    ) -> 'PromptChain':
        self.steps.append({
            "name": name,
            "system": system,
            "user_template": user_template,
            "max_tokens": max_tokens,
            "model": model or self.model,
            "transform": transform,
        })
        return self

    def run(self, initial_input: str) -> dict:
        """Execute all steps in sequence."""
        outputs = {"initial": initial_input}
        current = initial_input

        for step in self.steps:
            user_message = step["user_template"].replace("{previous_output}", current)
            user_message = user_message.replace("{initial_input}", initial_input)

            # Allow access to all previous outputs
            for name, output in outputs.items():
                user_message = user_message.replace(f"{{{name}}}", output)

            response = self.client.messages.create(
                model=step["model"],
                max_tokens=step["max_tokens"],
                system=step["system"],
                messages=[{"role": "user", "content": user_message}]
            )
            output = response.content[0].text

            # Apply transform if specified
            if step["transform"]:
                output = step["transform"](output)

            outputs[step["name"]] = output
            current = output

        return outputs


# Example: blog post generation pipeline
blog_chain = (
    PromptChain(model="claude-opus-4-6")
    .add_step(
        name="outline",
        system="You are an expert technical writer. Create detailed outlines.",
        user_template="Create a detailed 5-section outline for a blog post about: {initial_input}\n\nInclude key points for each section.",
        max_tokens=400,
    )
    .add_step(
        name="expanded_outline",
        system="You are a technical writer who adds depth and examples.",
        user_template="""Expand this outline with specific examples, code snippets to include, and sub-points:

Outline:
{outline}

Original topic: {initial_input}""",
        max_tokens=600,
    )
    .add_step(
        name="draft",
        system="You are a senior technical writer creating comprehensive blog content.",
        user_template="""Write a comprehensive blog post based on this expanded outline.
Include code examples, diagrams descriptions, and practical takeaways.

Expanded outline:
{expanded_outline}""",
        max_tokens=1500,
    )
    .add_step(
        name="final",
        system="You are an editor who improves clarity and flow.",
        user_template="""Edit this draft for clarity, flow, and impact.
Fix any technical inaccuracies and improve transitions.

Draft:
{draft}""",
        max_tokens=1500,
        model="claude-haiku-4-5-20251001",  # Use cheaper model for final editing
    )
)

result = blog_chain.run("Implementing rate limiting in distributed systems")
final_post = result["final"]

5. Meta-Prompting

Meta-prompting uses the LLM to generate prompts for sub-tasks. Instead of writing a prompt for every task variation, you write a prompt that generates prompts.

import anthropic
import json

client = anthropic.Anthropic()


def meta_prompt(
    task_description: str,
    context: str,
    model: str = "claude-opus-4-6",
) -> str:
    """
    Generate an optimal prompt for a specific task.
    The model acts as a prompt engineer.
    """
    response = client.messages.create(
        model=model,
        max_tokens=600,
        system="""You are an expert prompt engineer. Given a task description,
generate an optimal system prompt for Claude to perform that task.

The generated prompt should:
- Be clear and specific about the task
- Include format instructions
- Include relevant constraints
- Use XML tags for structure if needed
- Be between 100-300 words""",
        messages=[{
            "role": "user",
            "content": f"""Generate an optimal system prompt for this task:

Task: {task_description}
Context: {context}

Generate just the system prompt, nothing else."""
        }]
    )
    return response.content[0].text


def adaptive_pipeline(
    task: str,
    input_data: str,
    model: str = "claude-opus-4-6",
) -> dict:
    """
    Pipeline that generates its own prompt for the task, then executes it.
    Useful when task types are diverse and hard to write fixed prompts for.
    """

    # Step 1: Generate task-specific prompt
    task_prompt = meta_prompt(task, input_data[:200], model)

    # Step 2: Execute with generated prompt
    response = client.messages.create(
        model=model,
        max_tokens=800,
        system=task_prompt,
        messages=[{"role": "user", "content": input_data}]
    )
    output = response.content[0].text

    # Step 3: Self-evaluate the output
    eval_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Task: {task}
Generated prompt: {task_prompt[:200]}...
Output: {output[:400]}...

Did the output successfully complete the task? Score 0.0-1.0.
Score: """
        }]
    )
    score_text = eval_response.content[0].text.strip()
    try:
        score = float(score_text.split()[0])
    except (ValueError, IndexError):
        score = 0.7

    return {
        "generated_prompt": task_prompt,
        "output": output,
        "quality_score": score,
    }

6. Adversarial Critic Pattern

Use a separate model instance as a critic to find weaknesses in the primary model's output.

import anthropic
from dataclasses import dataclass

client = anthropic.Anthropic()


@dataclass
class CritiquedOutput:
    initial_output: str
    critique: str
    issues_found: list[str]
    revised_output: str
    confidence: float


def adversarial_critic(
    task: str,
    initial_output: str,
    critic_persona: str = "a skeptical expert who finds flaws",
    model_primary: str = "claude-opus-4-6",
    model_critic: str = "claude-opus-4-6",
) -> CritiquedOutput:
    """
    Two-model critique pattern:
    1. Primary model generates output
    2. Critic model (with adversarial persona) finds flaws
    3. Primary model revises based on critique

    Using the same model for both works - different system prompts create
    different "perspectives" even from the same underlying model.
    """

    # Step 2: Adversarial critique
    critique_response = client.messages.create(
        model=model_critic,
        max_tokens=400,
        system=f"""You are {critic_persona}.
Your job is to find every flaw, error, gap, and weakness in outputs you review.
Be harsh but fair. Find real issues, not nitpicks.""",
        messages=[{
            "role": "user",
            "content": f"""Task: {task}

This output was produced for the task above:
{initial_output}

Critique it harshly. What is wrong, misleading, incomplete, or problematic?
List specific issues."""
        }]
    )
    critique = critique_response.content[0].text

    # Parse issues
    issues = [
        line.strip()[2:] if line.startswith('- ') else line.strip()
        for line in critique.split('\n')
        if line.strip() and (line.startswith('-') or line.startswith('•'))
    ]

    # Step 3: Revision
    revision_response = client.messages.create(
        model=model_primary,
        max_tokens=800,
        messages=[{
            "role": "user",
            "content": f"""Task: {task}

Your initial output:
{initial_output}

A critic found these issues:
{critique}

Write a revised output that addresses all valid critiques."""
        }]
    )
    revised = revision_response.content[0].text

    # Quick confidence estimate
    conf_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=50,
        messages=[{
            "role": "user",
            "content": f"Rate 0.0-1.0 how well the revision addresses the critique:\nCritique: {critique[:200]}\nRevision: {revised[:200]}\nScore:"
        }]
    )
    try:
        confidence = float(conf_response.content[0].text.strip().split()[0])
    except (ValueError, IndexError):
        confidence = 0.8

    return CritiquedOutput(
        initial_output=initial_output,
        critique=critique,
        issues_found=issues,
        revised_output=revised,
        confidence=confidence,
    )

When to Use Each Technique

Technique	Latency	Token Cost	Best For	Avoid When
Self-Refinement	2-3x	3-5x	Writing, analysis, code	Simple tasks, hard to evaluate
Tree of Thought	5-10x	10-20x	Planning, puzzles, search	Most production tasks
ReAct	Variable	Variable	Agents with tools	Pure generation tasks
Meta-Prompting	1.5x	2x	Diverse task types	Stable, fixed task types
Prompt Chaining	2-4x	2-4x	Multi-stage pipelines	Simple one-shot tasks

Production Considerations

import anthropic
import asyncio
from typing import Callable

client = anthropic.AsyncAnthropic()


class ProductionRefinementPipeline:
    """
    Production-grade self-refinement with:
    - Async for parallelism
    - Cost budgets per request
    - Early stopping on quality threshold
    - Fallback to single-pass on timeout
    """

    def __init__(
        self,
        model: str = "claude-opus-4-6",
        max_iterations: int = 2,
        quality_threshold: float = 0.8,
        max_cost_usd: float = 0.10,     # Cost budget per request
        timeout_seconds: float = 30.0,
    ):
        self.model = model
        self.max_iterations = max_iterations
        self.quality_threshold = quality_threshold
        self.max_cost_usd = max_cost_usd
        self.timeout_seconds = timeout_seconds
        self.cost_per_1k_input = 0.003   # Sonnet pricing
        self.cost_per_1k_output = 0.015

    async def run(self, task: str, context: str) -> dict:
        """Run refinement with production safeguards."""
        try:
            return await asyncio.wait_for(
                self._run_refinement(task, context),
                timeout=self.timeout_seconds
            )
        except asyncio.TimeoutError:
            # Fallback: single pass
            response = await client.messages.create(
                model=self.model,
                max_tokens=600,
                messages=[{"role": "user", "content": f"{task}\n\nContext: {context}"}]
            )
            return {
                "output": response.content[0].text,
                "iterations": 1,
                "quality_score": None,
                "fallback": True,
                "reason": "timeout"
            }

    async def _run_refinement(self, task: str, context: str) -> dict:
        total_cost = 0.0
        current_output = ""
        quality_score = 0.0

        for i in range(self.max_iterations + 1):
            if i == 0:
                # Initial generation
                resp = await client.messages.create(
                    model=self.model,
                    max_tokens=600,
                    messages=[{"role": "user", "content": f"Task: {task}\nContext: {context}"}]
                )
            else:
                # Refinement
                resp = await client.messages.create(
                    model=self.model,
                    max_tokens=600,
                    messages=[{
                        "role": "user",
                        "content": f"Improve this output for the task: {task}\n\nCurrent output:\n{current_output}\n\nWhat is wrong or could be improved? Provide the improved version."
                    }]
                )

            # Track cost
            cost = (
                resp.usage.input_tokens / 1000 * self.cost_per_1k_input +
                resp.usage.output_tokens / 1000 * self.cost_per_1k_output
            )
            total_cost += cost
            current_output = resp.content[0].text

            # Quick quality check
            if i > 0 and total_cost > self.max_cost_usd * 0.8:
                break  # Close to budget, stop early

            # Evaluate quality (simplified)
            quality_score = min(0.7 + i * 0.1, 0.95)  # Improves each iteration
            if quality_score >= self.quality_threshold:
                break

        return {
            "output": current_output,
            "iterations": i + 1,
            "quality_score": quality_score,
            "total_cost_usd": total_cost,
            "fallback": False,
        }

Common Mistakes

:::danger Using Advanced Techniques for Simple Tasks Self-refinement, ToT, and ReAct multiply your token costs by 2-10x. For simple classification or formatting tasks, they add cost with no benefit. Match technique complexity to task complexity. Start with the simplest approach that works. :::

:::danger Infinite Refinement Loops Without iteration limits and quality thresholds, self-refinement can loop indefinitely - oscillating between two imperfect versions without converging. Always set max_iterations and use an early stopping condition based on quality score improvement between iterations. :::

:::warning Tree of Thought Has Exponential Cost With branching factor B and depth D, ToT makes O(B^D) model calls. With B=3 and D=3, that's 27+ calls. This is only appropriate for tasks where the cost of a wrong answer exceeds the cost of exhaustive search - strategic planning, critical decisions, mathematical proofs. :::

:::tip Self-Refinement Works Best When Evaluation Is Clear Self-refinement is most effective when the model can reliably evaluate its own outputs. Code review, technical writing, mathematical reasoning - these have clear correctness criteria. Open-ended creative tasks where "better" is subjective benefit less. The critique step needs to identify real improvements, not just stylistic differences. :::

:::tip Chain Cheap Models for Evaluation In a self-refinement loop, generation needs a powerful model (Opus, Sonnet) but evaluation/critique can often use a cheaper model (Haiku). The evaluation task - "is this output good?" - is simpler than the generation task. Split model usage by task complexity to control costs. :::

Interview Q&A

Q: What is self-refinement and when should you use it?

A: Self-refinement is an iterative loop where a model generates an output, then critiques that output, then revises it based on the critique. The key insight is that language models are often better at identifying flaws in an output than they are at producing a perfect output on the first attempt - the evaluator uses different reasoning than the generator. Use it when: the task has clear quality criteria that can be expressed in a prompt (correctness, completeness, tone), the output quality is measurably better after critique (verified experimentally), and the cost of additional iterations is worth the quality gain. Don't use it for simple tasks where a single pass is sufficient, or where "better" is too subjective to critique reliably.

Q: How does Tree of Thought differ from Chain-of-Thought?

A: Chain-of-Thought is a linear reasoning process: one path from problem to solution. Tree of Thought is a search process: generate multiple candidate paths at each step, evaluate their promise, keep the most promising branches, and prune dead ends - like beam search or Monte Carlo Tree Search applied to reasoning. CoT is better than zero-shot for any reasoning task. ToT is better than CoT when the solution space has many dead ends and the problem benefits from backtracking - planning problems, combinatorial puzzles, multi-step mathematical proofs. ToT has exponential token cost relative to depth, so it's impractical for most production use cases and better suited for offline high-stakes decisions.

Q: Explain the ReAct framework and how it enables tool use.

A: ReAct (Reasoning + Acting) interleaves explicit reasoning steps (Thought:) with tool calls (Action:) and observations (Observation:). Before each action, the model writes out its reasoning - what it knows, what it needs to find out, and why it's taking this specific action. After getting an observation, it reasons again about what the observation means and what to do next. This structure is valuable for two reasons: it makes the model's reasoning explicit and auditable (you can see why it called a tool), and it reduces "action hallucination" where the model invents tool results rather than actually calling them. In practice, ReAct is the foundation of most agentic systems.

Q: What is prompt chaining and how does it differ from a single complex prompt?

A: Prompt chaining breaks a complex multi-step task into a sequence of smaller, focused prompts where each output feeds into the next. A single complex prompt asks one model call to do everything - outline, draft, edit, format - simultaneously. Chaining asks each model call to do one thing well: first outline only, then expand the outline, then draft from the expanded outline, then edit the draft. Benefits: each step is easier to get right, outputs of earlier steps can be verified before proceeding, different models can be used for different steps (powerful model for generation, cheaper model for formatting), and the pipeline is easier to debug (you can inspect intermediate outputs). Downside: higher latency (sequential calls) and higher total token cost.

Q: When is meta-prompting useful and what are its risks?

A: Meta-prompting is useful when the task structure varies significantly across requests - so much that no single fixed prompt handles all variations well. Instead of writing 20 different system prompts for 20 task types, you write one meta-prompt that generates a task-specific prompt, then use that generated prompt for the actual task. It's particularly useful for internal tools where the LLM serves as a general-purpose task router. The risks: generated prompts can be worse than carefully hand-crafted ones (the meta-prompt quality determines generated prompt quality), adding an extra model call increases latency and cost, and the behavior is harder to predict and test (you're testing a system that generates its own instructions). Use it when task diversity genuinely exceeds what fixed prompts can handle - not as a shortcut for writing good prompts.

The Code Reviewer That Reviewed Itself​

The Landscape of Advanced Techniques​

1. Self-Refinement​

2. Tree of Thought (ToT)​

3. ReAct: Reasoning + Acting​

4. Prompt Chaining​

5. Meta-Prompting​

6. Adversarial Critic Pattern​

When to Use Each Technique​

Production Considerations​

Common Mistakes​

Interview Q&A​