Skip to main content

Tree-of-Thought Prompting

The Puzzle That Breaks Chain-of-Thought

Your team is building an AI system to generate strategic marketing campaign plans. The system needs to:

  • Select a target audience segment
  • Choose a messaging strategy
  • Decide on channel mix
  • Estimate budget allocation

You use chain-of-thought. The model picks a target audience, builds a messaging strategy on top of it, then selects channels to fit - and by the end, the budget doesn't make sense for the channels chosen. You try again. Different starting choice, same problem: the model commits to an early decision and then rationalizes everything downstream to fit it.

The fundamental issue: real planning requires exploring alternatives. A good human strategist doesn't commit to "millennials in urban areas" and then build a plan around that - they consider multiple audience segments, sketch implications of each, and then decide. Chain-of-thought can't do this. It moves in one direction.

Tree-of-Thought is the solution. Instead of a single linear reasoning chain, it explores a tree of possibilities - generating multiple candidate thoughts at each step, evaluating them, and expanding the most promising branches. It's the difference between walking in a straight line through a maze and systematically exploring it.

Why Linear Reasoning Fails for Planning

Chain-of-thought is powerful for step-by-step computation: arithmetic, logical deduction from premises, code tracing. In all of these, each step is determined by the previous one - there's one correct next step.

Planning and creative problem-solving are different. At each decision point, there are multiple valid next moves. The quality of a sequence of decisions depends on the entire path, not just local quality. A locally good decision can lead globally to a dead end.

Classic examples where CoT fails:

  • Creative writing: Multiple plausible plot directions, need to explore them
  • Mathematical proofs: Multiple proof strategies, need to try and abandon
  • Game playing: Tree search is foundational to game AI for exactly this reason
  • Scheduling/optimization: Multiple feasible schedules, need to evaluate globally

The Game of 24 became the canonical example in the Tree-of-Thought paper: given four numbers (e.g., 4, 9, 10, 13), use arithmetic operations to reach 24. Chain-of-thought achieves only 4% accuracy. Tree-of-Thought achieves 74%.

Why? Because the game requires systematic exploration: try 4+9=13, then 13×13=169 (dead end) → backtrack, try 4×9=36, then 36-13=23 (close but not 24) → backtrack, try (10-4)×(13-9)... and so on. CoT commits to one path. ToT explores the tree.

Historical Context: Yao et al. 2023

"Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023, Princeton & Google DeepMind) introduced the framework formally.

The paper's framing was explicit: it was drawing on Kahneman's System 1 / System 2 thinking (from "Thinking Fast and Slow"). CoT is System 1: fast, intuitive, sequential. ToT is System 2: slow, deliberate, exploratory. Just as humans switch between automatic and deliberate thinking depending on problem complexity, LLMs should switch between CoT and ToT.

The key insight: language models are both the generator of thoughts and the evaluator of those thoughts. You don't need a separate evaluation model - you can ask the same LLM: "Is this partial solution promising?" and use its judgment to guide the search.

The Three Components of Tree-of-Thought

Component 1: Thought Generation

At each step, generate multiple candidate "thoughts" - partial solutions or next moves. There are two approaches:

Sample (diversity): Generate K independent samples using higher temperature. Good when you want diverse candidates.

thoughts = []
for _ in range(k):
thought = llm.generate(state, temperature=0.8)
thoughts.append(thought)

Propose (structured): Ask the model to generate K distinct candidates in a single call.

prompt = f"""Given the current problem state:
{state}

Generate {k} different possible next steps. Each should be distinct.
Format: Step 1: ... Step 2: ... Step 3: ..."""

Component 2: State Evaluation

For each candidate thought, ask the model: how promising is this partial solution?

Value estimation (for scalar scoring):

Rate this partial solution on a scale of 1-10, where 10 means definitely leads to a solution:
[partial solution]

Classification (sure/likely/impossible):

For the Game of 24 with numbers [4, 9]:
Current expression: 4 + 9 = 13
Remaining numbers: [13, 13]
Can 13 and 13 produce 24? (sure/likely/impossible)

The evaluation step is what makes ToT expensive but powerful. You're making multiple LLM calls per node in the search tree.

Component 3: Search Algorithm

Breadth-First Search (BFS): At each depth level, generate all candidate thoughts, evaluate them, keep the top-b, and proceed to the next level. Good for problems where you want to explore broadly before going deep.

Depth-First Search (DFS): Follow the most promising path until you reach a solution or dead end, then backtrack. Good for problems where a solution can be verified quickly.

For most prompt engineering tasks, BFS with beam size 3-5 is the practical choice.

The Game of 24 Walkthrough

Problem: Use 4, 9, 10, 13 with +, -, ×, ÷ to make 24.

With CoT (fails):

Let me try: 4 + 9 = 13, 13 + 13 = 26, hmm that's not 24.
Let me try: 4 × 9 = 36, 36 - 13 = 23, 23 + 1... wait there's no 1.
I can't find a solution.

(The correct solution is (9 - 4) × (10 - 13 + ... wait, let's be precise: actually 13 - 9 = 4, 4 × 4 = 16, no... 10 - 4 = 6, 6 × (13 - 9) = 6 × 4 = 24! Yes.)

CoT misses this because it doesn't systematically explore all combinations.

With ToT:

State 0: Numbers available = [4, 9, 10, 13]

Generate thoughts (all pairwise operations):

  • 4+9=13 → [13, 10, 13]
  • 4-9=-5 → [-5, 10, 13]
  • 4×9=36 → [36, 10, 13]
  • 9-4=5 → [5, 10, 13]
  • 10-4=6 → [9, 6, 13] ← promising
  • 4×10=40 → [9, 40, 13]
  • ... (all 12 combinations)

Evaluate: Score each by likelihood of reaching 24. The evaluator gives "10-4=6 → [9, 6, 13]" a high score because 6×4=24 and we have 9, 6, 13.

Expand promising state [9, 6, 13]:

  • 6×(13-9) = 6×4 = 24 ✓ SOLUTION FOUND

Implementing Tree-of-Thought

import anthropic
import json
from dataclasses import dataclass, field
from typing import Optional

client = anthropic.Anthropic()

@dataclass
class ThoughtNode:
state: str
thought: str
value: float = 0.0
depth: int = 0
parent: Optional["ThoughtNode"] = None
children: list["ThoughtNode"] = field(default_factory=list)

def path(self) -> list[str]:
"""Return the reasoning path from root to this node."""
path = []
node = self
while node is not None:
if node.thought:
path.append(node.thought)
node = node.parent
return list(reversed(path))


def generate_thoughts(state: str, problem: str, n_thoughts: int = 3) -> list[str]:
"""Generate multiple candidate next steps."""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
temperature=0.8,
messages=[{
"role": "user",
"content": f"""Problem: {problem}

Current state: {state}

Generate exactly {n_thoughts} different possible next steps to advance toward a solution.
Each should be meaningfully different - explore different approaches.
Number each step as "Option 1:", "Option 2:", etc.
Keep each option to 1-2 sentences."""
}]
)
text = message.content[0].text
options = []
for i in range(1, n_thoughts + 1):
import re
match = re.search(rf"Option {i}:(.*?)(?=Option {i+1}:|$)", text, re.DOTALL)
if match:
options.append(match.group(1).strip())
return options if options else [text]


def evaluate_thought(state: str, thought: str, problem: str) -> float:
"""Evaluate how promising a partial solution is. Returns 0.0-1.0."""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
temperature=0,
messages=[{
"role": "user",
"content": f"""Problem: {problem}

Current state: {state}
Proposed next step: {thought}

Rate the promise of this approach on a scale from 0 to 10, where:
- 0: This is a dead end or clearly wrong
- 5: Uncertain, might work
- 10: This will definitely lead to a solution

Respond with ONLY a number between 0 and 10."""
}]
)

try:
score = float(message.content[0].text.strip())
return min(max(score / 10.0, 0.0), 1.0)
except ValueError:
return 0.5


def check_solution(state: str, problem: str) -> tuple[bool, str]:
"""Check if the current state is a solution."""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
temperature=0,
messages=[{
"role": "user",
"content": f"""Problem: {problem}

Current state: {state}

Has this problem been solved? Respond with:
SOLVED: [explanation] if yes
NOT_SOLVED: [what's still needed] if no"""
}]
)
text = message.content[0].text.strip()
is_solved = text.upper().startswith("SOLVED")
return is_solved, text


def tree_of_thought_bfs(
problem: str,
initial_state: str,
max_depth: int = 3,
beam_width: int = 3,
thoughts_per_step: int = 3
) -> Optional[ThoughtNode]:
"""
BFS-based Tree of Thought search.

Args:
problem: Problem description
initial_state: Starting state description
max_depth: Maximum depth to search
beam_width: Number of best nodes to keep per level
thoughts_per_step: Number of candidate thoughts to generate per node

Returns:
Best solution node found, or None
"""
root = ThoughtNode(state=initial_state, thought="", depth=0)
beam = [root]

for depth in range(max_depth):
print(f"\n--- Depth {depth + 1} ---")
next_beam_candidates = []

for node in beam:
# Check if already solved
is_solved, explanation = check_solution(node.state, problem)
if is_solved:
print(f"Solution found at depth {depth}!")
print("Path:", "\n".join(node.path()))
return node

# Generate candidate thoughts
thoughts = generate_thoughts(node.state, problem, thoughts_per_step)

for thought in thoughts:
# Create new state by applying thought
new_state = f"{node.state}\n\nStep {depth + 1}: {thought}"

# Evaluate promise
value = evaluate_thought(node.state, thought, problem)
print(f" Thought: {thought[:60]}... → value: {value:.2f}")

child = ThoughtNode(
state=new_state,
thought=thought,
value=value,
depth=depth + 1,
parent=node
)
node.children.append(child)
next_beam_candidates.append(child)

# Keep top beam_width candidates
next_beam_candidates.sort(key=lambda n: n.value, reverse=True)
beam = next_beam_candidates[:beam_width]

print(f"Top {beam_width} nodes retained.")

# Return best node (highest value leaf)
all_leaves = [n for n in next_beam_candidates]
if all_leaves:
return max(all_leaves, key=lambda n: n.value)
return None


# Example: Marketing strategy planning
problem = """
We need to create a 3-month digital marketing campaign for a B2B SaaS product
(project management tool, $49/month). Target: small tech companies (10-50 employees).
Budget: $15,000. Goal: 100 trial signups.

Create a complete campaign strategy with: target audience, messaging, channel mix,
and budget allocation.
"""

initial_state = "Starting campaign planning. All budget ($15,000) unallocated. No strategy defined yet."

result = tree_of_thought_bfs(
problem=problem,
initial_state=initial_state,
max_depth=3,
beam_width=3,
thoughts_per_step=3
)

if result:
print("\n=== FINAL STRATEGY ===")
print(result.state)
print("\nReasoning path:")
for step in result.path():
print(f" - {step}")

Simplified ToT for Creative Writing

import anthropic

client = anthropic.Anthropic()

def tot_creative_writing(
story_premise: str,
n_directions: int = 3,
evaluation_criteria: str = "engagement, originality, and narrative tension"
) -> str:
"""
Generate multiple story directions and select the best one.
Simplified ToT for creative writing.
"""

# Step 1: Generate multiple story directions
directions_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=800,
temperature=0.9,
messages=[{
"role": "user",
"content": f"""Story premise: {story_premise}

Generate {n_directions} distinctly different directions this story could take.
For each direction, write 2-3 sentences describing where the story goes.
Label them Direction 1, Direction 2, etc."""
}]
)
directions_text = directions_response.content[0].text

# Step 2: Evaluate each direction
evaluation_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
temperature=0,
messages=[{
"role": "user",
"content": f"""Story premise: {story_premise}

The following story directions have been proposed:

{directions_text}

Evaluate each direction for {evaluation_criteria}.
Rate each from 1-10 and briefly explain. Then state which direction is best and why.
End with: BEST DIRECTION: [number]"""
}]
)
evaluation_text = evaluation_response.content[0].text

# Step 3: Develop the best direction
import re
best_match = re.search(r"BEST DIRECTION:\s*(\d+)", evaluation_text)
best_num = best_match.group(1) if best_match else "1"

develop_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
temperature=0.7,
messages=[{
"role": "user",
"content": f"""Story premise: {story_premise}

From the evaluation:
{evaluation_text}

Develop Direction {best_num} into a full story opening (3-4 paragraphs).
Make it compelling and vivid."""
}]
)

return develop_response.content[0].text


# Usage
result = tot_creative_writing(
story_premise="An AI researcher discovers their most advanced model has been secretly running experiments on them.",
n_directions=3
)
print(result)

Graph of Thoughts: The Extension

Graph of Thoughts (Besta et al., 2023) extends ToT from a tree to a general graph structure, allowing:

  • Merging: Combine two reasoning paths that converge on the same intermediate insight
  • Aggregation: Pool insights from multiple branches before proceeding
  • Cycles: Revisit earlier states with new information

The graph structure is more powerful but significantly more complex to implement. For most production use cases, ToT with BFS/DFS is sufficient.

When ToT is Overkill

ToT is 10-100x more expensive than a single CoT pass. It's only justified when:

Use CaseJustified?
Game of 24 / combinatorial puzzlesYes
Complex multi-step code generationSometimes
Strategic planning with many alternativesYes
Mathematical proof searchYes
Simple Q&ANo - use CoT
ClassificationNo - use zero-shot
Text summarizationNo
Data extractionNo

Rule of thumb: Use ToT when the problem has explicit decision points where multiple valid alternatives exist, and where the quality of the final solution depends on global path quality rather than local step quality.

Cost Implications

A concrete cost comparison for a planning task:

ApproachLLM CallsTokens (approx)Cost (approx)
Zero-shot1200$0.0006
CoT1600$0.002
ToT (depth=3, beam=3, k=3)30-508,000-15,0000.050.05-0.15
ToT + self-consistency150-25040,000-75,0000.250.25-0.75

ToT is for high-value tasks where a 0.100.10-1.00 cost per query is acceptable.

Production Engineering Notes

1. Time Out Aggressively

ToT searches can take 20-60 seconds with multiple sequential LLM calls. Always set a maximum time budget:

import asyncio

async def tot_with_timeout(problem: str, initial_state: str, timeout: float = 30.0):
try:
return await asyncio.wait_for(
asyncio.to_thread(tree_of_thought_bfs, problem, initial_state),
timeout=timeout
)
except asyncio.TimeoutError:
# Return best partial solution found so far
return None

2. Cache Evaluations

The same (state, thought) pair may be evaluated multiple times in a search. Cache evaluation results to avoid redundant LLM calls:

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_evaluate(state_hash: str, thought: str, problem: str) -> float:
return evaluate_thought(state_hash, thought, problem)

3. Use Smaller Models for Evaluation

The evaluation step doesn't require the full capability of your most powerful model. Use a smaller, faster model for evaluation and a larger model only for thought generation:

# Evaluation: cheaper, faster
eval_model = "claude-haiku-3-5" # or gpt-4o-mini

# Generation: more capable
gen_model = "claude-sonnet-4-6"

Common Mistakes

:::danger Mistake 1: Using ToT for Simple Tasks If chain-of-thought works, ToT is wasteful. Always start with the simplest technique that achieves your accuracy target. ToT is a last resort for genuinely hard search problems. :::

:::danger Mistake 2: Not Implementing Proper Backtracking A tree search without backtracking is just a parallel CoT. The power of ToT is the ability to abandon dead ends. Ensure your implementation actually backtracks when the evaluation score drops below a threshold. :::

:::warning Mistake 3: Too Many Branches More branches = more cost with diminishing returns. Start with k=3 thoughts per step and beam width=3. Increase only if accuracy is insufficient and cost is acceptable. :::

:::warning Mistake 4: Poor Evaluation Prompts The search is only as good as the evaluation. If your evaluation prompt doesn't accurately assess whether a partial solution is promising, you'll prune the right answers. Spend time engineering the evaluation prompt carefully. :::

Interview Q&A

Q1: What problem does Tree-of-Thought solve that Chain-of-Thought cannot?

Chain-of-thought follows a single linear reasoning path - it can't backtrack or explore alternatives. For problems with multiple decision points where the correct path depends on global, not local, quality (planning, proof search, game playing, optimization), CoT commits to early decisions and can get stuck. Tree-of-Thought maintains multiple reasoning paths in parallel, evaluates them, and prunes unpromising branches - enabling systematic exploration with backtracking. The Game of 24 is the canonical example: CoT achieves 4% accuracy, ToT achieves 74%.

Q2: What are the three components of Tree-of-Thought?

(1) Thought generation: at each step, generate K candidate next steps (either by sampling K times or by asking the model to produce K alternatives at once). (2) State evaluation: for each candidate thought, ask the model to assess how promising this partial solution is - typically as a score or (sure/likely/impossible) classification. (3) Search algorithm: BFS (explore all promising candidates at each depth before going deeper) or DFS (follow the best path until solution or dead end, then backtrack). These components can all use the same base LLM.

Q3: When is BFS better than DFS for Tree-of-Thought, and vice versa?

BFS is better when: you don't know the depth at which solutions will be found, the search space is relatively small, or you want to find the globally best solution rather than just any solution. BFS explores broadly before going deep. DFS is better when: solutions tend to be at a consistent depth, early termination is likely, or the search space is large and DFS is more likely to hit a solution before exhausting its budget. For most prompt engineering applications, BFS with a beam (keeping only top-k candidates per level) is the practical choice.

Q4: How does ToT use LLMs differently from CoT?

CoT uses the LLM as a sequential reasoner - one call, one reasoning chain, one answer. ToT uses the LLM in two distinct roles: as a generator (produce multiple candidate next steps) and as an evaluator (assess how promising each partial solution is). The same model can play both roles. This meta-cognitive ability - using an LLM to judge the quality of its own outputs - is the key innovation. It effectively turns the LLM into both the search policy and the value function in a tree search algorithm.

Q5: What is the Graph of Thoughts extension and what does it add?

Graph of Thoughts (Besta et al., 2023) generalizes ToT from a tree structure to a directed graph. This enables: (1) Merging - two reasoning paths that independently reach the same intermediate insight can be merged, preventing redundant exploration; (2) Aggregation - insights from multiple branches can be combined before proceeding; (3) Cycles - the system can revisit earlier states with new information. The graph structure can represent more complex reasoning patterns, but is significantly harder to implement correctly and the computational complexity grows quickly.

Q6: How would you decide whether to use CoT vs. ToT for a production task?

Start with the simplest approach. Use CoT if: the task is sequential with a clear step-by-step process, there are no meaningful decision points where alternatives should be explored, and CoT accuracy is sufficient. Move to ToT if: (a) CoT accuracy is clearly insufficient despite good prompt engineering, (b) the task has explicit decision points where alternatives should be explored, (c) the quality of the solution depends on global path choices, and (d) you can afford 10-100x the cost of CoT per query. Also consider whether the task might be better served by fine-tuning or a specialized algorithm rather than ToT.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Tree of Thought Reasoning demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.