Planning with LLMs
The Surprise and the Trap
Here is a thing that surprises many engineers: LLMs are actually quite good at planning. Ask GPT-4 to produce a step-by-step plan for setting up a CI/CD pipeline, or analyzing a business strategy, or debugging a complex system - the result is often coherent, well-ordered, and captures the important dependencies.
Here is the trap: LLMs are also wrong about plans in predictable, systematic ways. They skip non-obvious prerequisites. They underestimate the complexity of steps that look simple. They hallucinate the availability of tools or APIs. They fail to account for error conditions. And they degrade rapidly beyond roughly 7 planning steps.
The engineers who build reliable planning systems are the ones who understand both the strength (surprising planning capability) and the failure modes (systematic, predictable errors). This lesson covers both - and the algorithms designed to work with these characteristics.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Planning demo on the EngineersOfAI Playground - no code required. :::
Why LLMs Can Plan at All
LLMs encode a compressed representation of a vast amount of human-written plans: project plans, recipe instructions, troubleshooting guides, scientific experimental designs, software architecture documents. The model has seen, in some form, millions of examples of how humans sequence steps toward goals.
This is why zero-shot planning often works for common task types. The model is pattern-matching against known templates. "Set up a web server" has a recognizable template. "Implement authentication" has a recognizable template. The plan that emerges is often a reasonable instantiation of that template.
The failure mode comes when the task is novel, highly specific, or crosses domain boundaries in ways that do not match known templates. In those cases, the model stitches together plausible-sounding steps that do not actually compose correctly.
Planning Approaches
Zero-Shot Planning
The simplest approach: ask the LLM to generate a plan in a single prompt, with no examples.
def zero_shot_plan(goal: str) -> list[str]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a planning assistant. Generate a step-by-step plan to achieve the given goal. Be specific and actionable."},
{"role": "user", "content": f"Goal: {goal}\n\nProvide a numbered step-by-step plan."},
],
temperature=0.3,
)
text = response.choices[0].message.content
lines = [l.strip() for l in text.split("\n") if l.strip() and l.strip()[0].isdigit()]
return lines
When to use: simple, well-understood task types. When requirements are clear and the task is in a domain well-represented in training data.
Limitations: no self-correction; errors in early steps cannot be caught before generating later steps.
Chain-of-Thought Planning
Ask the model to reason step-by-step before producing the plan. The reasoning process helps the model catch errors and think through dependencies.
COT_PLANNING_PROMPT = """
Before writing the final plan, reason through:
1. What is the exact goal? What does success look like?
2. What are the prerequisites - what must exist/be true before we start?
3. What are the major phases of work?
4. What dependencies exist between phases?
5. What could go wrong, and how should we handle it?
After this reasoning, produce the final numbered plan.
Format your response as:
## Reasoning
[your step-by-step reasoning]
## Plan
1. [first step]
2. [second step]
...
"""
Chain-of-thought planning consistently outperforms zero-shot planning on complex tasks. The reasoning phase forces the model to confront dependencies and edge cases before committing them to a plan.
ReWOO: Detaching Planning from Execution
ReWOO (Reasoning WithOut Observation) separates planning from execution entirely. The planner generates the complete plan upfront, with explicit variable references for future tool results. The executor then runs the plan without calling the LLM again until execution is complete.
Plan:
Step 1: Search("current Python version") → #E1
Step 2: Search("FastAPI compatibility " + #E1) → #E2
Step 3: WriteFile("requirements.txt", fastapi_version(#E2)) → #E3
Execute:
#E1 = tool_call(search, "current Python version") = "Python 3.12"
#E2 = tool_call(search, "FastAPI compatibility Python 3.12") = "FastAPI 0.111.0+"
#E3 = tool_call(write_file, "requirements.txt", "fastapi>=0.111.0")
The advantage: far fewer LLM calls during execution. The disadvantage: the plan must be complete and correct before execution starts. ReWOO works best when the task structure is predictable enough that the plan rarely needs revision.
Tree of Thoughts (ToT)
Tree of Thoughts is the most powerful planning approach for complex, open-ended tasks. Instead of generating a single plan, ToT explores multiple plan branches simultaneously, evaluates each branch, and selects the best path forward.
Three operations define ToT:
- Generate: produce N candidate next steps from the current state
- Evaluate: score each candidate (heuristic, LLM judge, or both)
- Select: keep the top-k candidates; prune the rest
The search strategy can be breadth-first (explore all branches at each depth) or depth-first (follow the most promising branch until it succeeds or fails).
Full Implementation: Tree of Thoughts Planner
"""
tree_of_thoughts.py
Complete Tree of Thoughts planning implementation with backtracking.
Requirements:
pip install openai pydantic
"""
from __future__ import annotations
import json
import textwrap
from dataclasses import dataclass, field
from typing import Optional
import heapq
from openai import OpenAI
client = OpenAI()
# ─── Data Structures ──────────────────────────────────────────────────────────
@dataclass
class ThoughtNode:
"""A single node in the Tree of Thoughts."""
thought: str # the plan step / idea
depth: int # how deep in the tree
score: float = 0.0 # evaluation score (0–1)
parent: Optional["ThoughtNode"] = None
children: list["ThoughtNode"] = field(default_factory=list)
is_terminal: bool = False # True if this is a complete plan
is_pruned: bool = False # True if evaluated as dead end
def path_from_root(self) -> list[str]:
"""Return all thoughts from root to this node."""
path = []
node = self
while node is not None:
path.append(node.thought)
node = node.parent
path.reverse()
return path
def __lt__(self, other: "ThoughtNode") -> bool:
# For priority queue: higher score = higher priority
return self.score > other.score
# ─── Generation ───────────────────────────────────────────────────────────────
GENERATE_PROMPT = textwrap.dedent("""
You are a planner generating possible next steps for achieving a goal.
Goal: {goal}
Current plan so far:
{current_path}
Generate {n_candidates} distinct candidate next steps. Each should be:
- Specific and actionable
- Meaningfully different from the others (not just rephrasing)
- A logical continuation of the current plan
Respond with a JSON array of strings, each being one candidate next step.
Example: ["Step option 1", "Step option 2", "Step option 3"]
""").strip()
EVALUATE_PROMPT = textwrap.dedent("""
You are evaluating potential next steps in a plan to achieve a goal.
Goal: {goal}
Plan so far:
{current_path}
Candidate next step: {candidate}
Score this candidate from 0.0 to 1.0 based on:
- Relevance: does it move toward the goal? (0.4 weight)
- Feasibility: is it actually doable? (0.3 weight)
- Efficiency: is it the right step at this point? (0.3 weight)
Also determine: is this step a terminal step that completes the plan? (true/false)
Respond with JSON: {{"score": 0.85, "is_terminal": false, "reasoning": "brief explanation"}}
""").strip()
def generate_candidates(
goal: str,
current_path: list[str],
n_candidates: int = 3,
model: str = "gpt-4o-mini",
) -> list[str]:
"""Generate N candidate next steps given the current plan state."""
path_str = "\n".join(f"{i+1}. {step}" for i, step in enumerate(current_path)) if current_path else "(none yet)"
response = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": GENERATE_PROMPT.format(
goal=goal,
current_path=path_str,
n_candidates=n_candidates,
),
}],
temperature=0.8, # Higher temp for diversity
response_format={"type": "json_object"},
)
raw = response.choices[0].message.content
# The model returns either an array directly or {"steps": [...]}
parsed = json.loads(raw)
if isinstance(parsed, list):
return parsed
for key in ("steps", "candidates", "options", "next_steps"):
if key in parsed:
return parsed[key]
# Fallback: take first list value
return next(v for v in parsed.values() if isinstance(v, list))
def evaluate_candidate(
goal: str,
current_path: list[str],
candidate: str,
model: str = "gpt-4o-mini",
) -> tuple[float, bool]:
"""Evaluate a candidate step. Returns (score, is_terminal)."""
path_str = "\n".join(f"{i+1}. {step}" for i, step in enumerate(current_path)) if current_path else "(none yet)"
response = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": EVALUATE_PROMPT.format(
goal=goal,
current_path=path_str,
candidate=candidate,
),
}],
temperature=0.1, # Low temp for consistent evaluation
response_format={"type": "json_object"},
)
raw = response.choices[0].message.content
data = json.loads(raw)
return float(data["score"]), bool(data.get("is_terminal", False))
# ─── Tree of Thoughts Search ──────────────────────────────────────────────────
class TreeOfThoughts:
"""
Tree of Thoughts planner using best-first search with LLM generation
and evaluation at each node.
"""
def __init__(
self,
model: str = "gpt-4o-mini",
n_candidates: int = 3, # branches to explore at each node
beam_width: int = 2, # keep top-K beams (BFS mode)
max_depth: int = 8, # maximum plan depth
prune_threshold: float = 0.4, # discard candidates below this score
strategy: str = "best_first", # "bfs", "dfs", or "best_first"
):
self.model = model
self.n_candidates = n_candidates
self.beam_width = beam_width
self.max_depth = max_depth
self.prune_threshold = prune_threshold
self.strategy = strategy
self._llm_calls = 0
def search(self, goal: str) -> list[str]:
"""
Run Tree of Thoughts search for the given goal.
Returns the best plan found as an ordered list of steps.
"""
print(f"\nTree of Thoughts search for: {goal[:60]}...")
print(f"Strategy: {self.strategy} | Candidates: {self.n_candidates} | Max depth: {self.max_depth}")
root = ThoughtNode(thought="[START]", depth=0, score=1.0)
if self.strategy == "bfs":
return self._bfs_search(goal, root)
elif self.strategy == "dfs":
return self._dfs_search(goal, root, [])
else:
return self._best_first_search(goal, root)
def _best_first_search(self, goal: str, root: ThoughtNode) -> list[str]:
"""
Best-first search: always expand the highest-scored node.
Implemented with a max-heap (priority queue).
"""
# Priority queue: (negative_score, node) - we negate because heapq is min-heap
frontier: list[tuple[float, ThoughtNode]] = [(-root.score, root)]
best_terminal: Optional[ThoughtNode] = None
while frontier:
_, node = heapq.heappop(frontier)
if node.is_pruned or node.depth >= self.max_depth:
continue
if node.is_terminal:
if best_terminal is None or node.score > best_terminal.score:
best_terminal = node
print(f" ✅ Found terminal plan at depth {node.depth} (score: {node.score:.2f})")
continue
current_path = node.path_from_root()[1:] # exclude [START]
print(f"\n Expanding at depth {node.depth} (score: {node.score:.2f})")
# Generate candidates
candidates = generate_candidates(goal, current_path, self.n_candidates, self.model)
self._llm_calls += 1
# Evaluate each candidate
scored: list[tuple[float, str, bool]] = []
for candidate in candidates:
score, is_terminal = evaluate_candidate(goal, current_path, candidate, self.model)
self._llm_calls += 1
scored.append((score, candidate, is_terminal))
print(f" [{score:.2f}] {candidate[:60]}")
# Expand top candidates
for score, candidate, is_terminal in scored:
if score < self.prune_threshold:
print(f" ✂️ Pruning (score {score:.2f} < threshold {self.prune_threshold})")
continue
child = ThoughtNode(
thought=candidate,
depth=node.depth + 1,
score=score,
parent=node,
is_terminal=is_terminal,
)
node.children.append(child)
heapq.heappush(frontier, (-score, child))
# Early exit if we've found a good-enough plan
if best_terminal and best_terminal.score >= 0.9:
break
if best_terminal:
plan = best_terminal.path_from_root()[1:] # exclude [START]
print(f"\n Total LLM calls: {self._llm_calls}")
return plan
# No terminal found - return the deepest high-scored path
return self._extract_best_partial_plan(root)
def _bfs_search(self, goal: str, root: ThoughtNode) -> list[str]:
"""
Beam search (BFS variant): keep top beam_width nodes at each depth level.
"""
current_level = [root]
best_terminal: Optional[ThoughtNode] = None
for depth in range(self.max_depth):
if not current_level:
break
print(f"\n BFS depth {depth+1}: expanding {len(current_level)} node(s)")
next_level: list[ThoughtNode] = []
for node in current_level:
if node.is_terminal:
if best_terminal is None or node.score > best_terminal.score:
best_terminal = node
continue
current_path = node.path_from_root()[1:]
candidates = generate_candidates(goal, current_path, self.n_candidates, self.model)
self._llm_calls += 1
for candidate in candidates:
score, is_terminal = evaluate_candidate(goal, current_path, candidate, self.model)
self._llm_calls += 1
if score < self.prune_threshold:
continue
child = ThoughtNode(
thought=candidate,
depth=depth + 1,
score=score,
parent=node,
is_terminal=is_terminal,
)
node.children.append(child)
next_level.append(child)
# Beam selection: keep top beam_width
next_level.sort(key=lambda n: n.score, reverse=True)
current_level = next_level[:self.beam_width]
if best_terminal:
return best_terminal.path_from_root()[1:]
return self._extract_best_partial_plan(root)
def _dfs_search(self, goal: str, node: ThoughtNode, current_path: list[str]) -> list[str]:
"""
Depth-first search with backtracking.
Returns the first complete plan found.
"""
if node.is_terminal:
return current_path
if node.depth >= self.max_depth:
return current_path # return partial
candidates = generate_candidates(goal, current_path, self.n_candidates, self.model)
self._llm_calls += 1
scored: list[tuple[float, str, bool]] = []
for candidate in candidates:
score, is_terminal = evaluate_candidate(goal, current_path, candidate, self.model)
self._llm_calls += 1
scored.append((score, candidate, is_terminal))
# Sort by score descending and try each
scored.sort(reverse=True)
for score, candidate, is_terminal in scored:
if score < self.prune_threshold:
continue
child = ThoughtNode(
thought=candidate,
depth=node.depth + 1,
score=score,
parent=node,
is_terminal=is_terminal,
)
node.children.append(child)
result = self._dfs_search(goal, child, current_path + [candidate])
if result:
return result # Found a complete plan
return current_path # Backtrack
def _extract_best_partial_plan(self, root: ThoughtNode) -> list[str]:
"""Extract the deepest, highest-scored path when no terminal is found."""
best_node = root
stack = [root]
while stack:
node = stack.pop()
if node.depth > best_node.depth or (
node.depth == best_node.depth and node.score > best_node.score
):
best_node = node
stack.extend(node.children)
return best_node.path_from_root()[1:]
# ─── Plan Validation ──────────────────────────────────────────────────────────
VALIDATION_PROMPT = """
Review this plan for achieving the goal: {goal}
Plan:
{plan_text}
Evaluate:
1. Are all steps necessary? (no redundant steps)
2. Are there missing prerequisites?
3. Are the steps in the right order?
4. Is the plan feasible with standard software development tools?
Respond with JSON:
{{
"is_valid": true/false,
"issues": ["issue 1", "issue 2"],
"missing_steps": ["missing step 1"],
"revised_plan": ["step 1", "step 2", ...] // if invalid, provide corrected version
}}
"""
def validate_plan(goal: str, plan: list[str], model: str = "gpt-4o") -> dict:
"""Validate a plan for completeness, ordering, and feasibility."""
plan_text = "\n".join(f"{i+1}. {step}" for i, step in enumerate(plan))
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": VALIDATION_PROMPT.format(goal=goal, plan_text=plan_text)}],
temperature=0.1,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
# ─── ReWOO Planner ────────────────────────────────────────────────────────────
REWOO_PROMPT = """
You are a planning system that creates complete, upfront plans with variable references.
Goal: {goal}
Available tools: search_web, read_file, write_file, run_command, call_api
Create a complete plan where:
- Each step has a format: "Action: tool_name(args) -> #Ei"
- Later steps can reference earlier results using #Ei notation
- The plan should be complete - no steps are decided at runtime
Example:
Step 1: Action: search_web("Python latest stable version") -> #E1
Step 2: Action: search_web("FastAPI version compatible with " + #E1) -> #E2
Step 3: Action: write_file("requirements.txt", "fastapi>=" + version_from(#E2)) -> #E3
Now create a plan for: {goal}
"""
def rewoo_plan(goal: str, model: str = "gpt-4o") -> list[str]:
"""Generate a ReWOO-style plan with variable references."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": REWOO_PROMPT.format(goal=goal)}],
temperature=0.2,
)
text = response.choices[0].message.content
lines = [l.strip() for l in text.split("\n") if l.strip().startswith("Step")]
return lines
# ─── Demo ─────────────────────────────────────────────────────────────────────
def demo_planning_approaches():
goal = (
"Set up automated integration tests for a FastAPI application "
"that runs in CI on every pull request."
)
print("=" * 60)
print("PLANNING APPROACH COMPARISON")
print("=" * 60)
# 1. Zero-shot plan
print("\n1. ZERO-SHOT PLAN")
print("-" * 40)
steps = zero_shot_plan(goal)
for i, step in enumerate(steps[:6], 1):
print(f" {i}. {step}")
# 2. ReWOO plan
print("\n2. ReWOO PLAN (upfront, with variable references)")
print("-" * 40)
rewoo_steps = rewoo_plan(goal)
for step in rewoo_steps[:6]:
print(f" {step}")
# 3. Tree of Thoughts (best-first)
print("\n3. TREE OF THOUGHTS (best-first search)")
print("-" * 40)
tot = TreeOfThoughts(
model="gpt-4o-mini",
n_candidates=3,
max_depth=6,
prune_threshold=0.5,
strategy="best_first",
)
plan = tot.search(goal)
print("\nBest plan found:")
for i, step in enumerate(plan, 1):
print(f" {i}. {step}")
# 4. Validate the plan
print("\n4. PLAN VALIDATION")
print("-" * 40)
validation = validate_plan(goal, plan)
print(f" Valid: {validation['is_valid']}")
if validation.get("issues"):
print(f" Issues: {validation['issues']}")
if validation.get("missing_steps"):
print(f" Missing: {validation['missing_steps']}")
if __name__ == "__main__":
demo_planning_approaches()
The Planning-Execution Gap
A fundamental truth about plans: they are always wrong before execution begins. Not always catastrophically wrong - often just slightly off. A step that seemed clear in the plan reveals unexpected complexity. A tool returns data in a different format than expected. A prerequisite turns out not to be met.
The planning-execution gap is not a failure of the planner. It is an inherent property of operating under uncertainty. The right response is not to make better plans - it is to build systems that can adapt when the plan diverges from reality.
Three strategies for managing the gap:
- Tight feedback loops: verify each step before proceeding. Small steps + early verification = small gaps.
- Replanning triggers: define clear conditions that trigger a full replan (step fails, output is unexpected, time budget exceeded).
- Adaptive execution: allow the executor to make minor plan adjustments without triggering a full replan. Reserve full replanning for major deviations.
Horizon Limits: How Far Can LLMs Reliably Plan?
Empirical evidence from planning benchmarks (AlfWorld, WebArena, GAIA) suggests LLMs can reliably plan 5–7 steps ahead on tasks with clear structure. Beyond that, plan quality degrades measurably.
This is not a fundamental limit - it reflects the training data distribution and the difficulty of maintaining coherence over long reasoning chains. Newer models push this boundary gradually.
Practical implication: for tasks requiring 10+ steps, do not plan the entire thing upfront. Plan 5–7 steps, execute them, then replan with the accumulated context. Interleaved planning with smaller horizons consistently outperforms trying to plan everything at once.
Production Notes
:::warning Planning Cost
Tree of Thoughts makes O(depth × n_candidates) LLM calls. With depth=8, n_candidates=3, that is 24 generation calls + 72 evaluation calls = 96 LLM calls for one plan. Use gpt-4o-mini for generation/evaluation and reserve gpt-4o for final validation. Cache plans aggressively - the same goal should not be replanned from scratch.
:::
:::danger Confident but Wrong LLMs produce confident plans for tasks outside their training distribution. High confidence does not mean high accuracy. Always validate plans (step 4 in the demo) before execution, especially when the task involves external systems, specific APIs, or domain-specific constraints the model may not know. :::
Plan determinism: set temperature=0.2 or lower for planning calls when you need reproducible plans (testing, debugging). Use higher temperature (0.7–0.9) only in the Tree of Thoughts candidate generation phase, where diversity is valuable.
Token efficiency: plans in JSON format are more parseable but use more tokens than plain numbered lists. For internal planning, use compact formats; only use JSON when you need to parse the plan programmatically.
Interview Questions and Answers
Q: What is Tree of Thoughts and why is it better than chain-of-thought for complex planning?
A: Chain-of-thought generates a single reasoning chain - if it goes wrong at step 3, everything after is compromised. Tree of Thoughts explores multiple branches simultaneously, evaluates each, and selects the best path. It is essentially best-first search or beam search over a tree of possible plans, using an LLM as both the generator (what steps are possible?) and the evaluator (how good is this step?). For tasks with multiple viable approaches or significant uncertainty about the best path, ToT consistently outperforms single-chain methods. The cost is O(branches × depth) LLM calls vs. O(depth) for CoT.
Q: How do you handle replanning when execution diverges from the plan?
A: I use trigger conditions: if a step fails after N retries, if the step output is semantically far from what was expected, or if the accumulated cost or time exceeds the budget. Replanning takes as input: the original goal, a summary of completed steps and their results, the specific failure/deviation, and the remaining intended plan. The replanner generates a new plan for the remaining work, informed by real execution results rather than hypothetical ones. The key is that replanning does not restart from zero - it builds on what was successfully completed.
Q: What is ReWOO and when would you use it?
A: ReWOO stands for "Reasoning Without Observation." It separates planning from execution entirely: the planner generates the complete plan upfront, including variable references for future tool results (e.g., #E1, #E2). The executor then runs all steps mechanically without any further LLM reasoning. This dramatically reduces total LLM calls compared to the ReAct loop, which calls the LLM at every step. Use ReWOO when the task structure is predictable (you can plan all steps before seeing any results), the tools are reliable, and you need to minimize latency or cost. Use ReAct or interleaved planning when the task is exploratory and each step might reveal information that changes the next.
Q: What are the systematic failure modes of LLM planning?
A: Several predictable patterns: (1) Prerequisite blindness - the model plans step B without noticing B requires X which must be set up first. (2) Underestimation - treating "deploy the application" as a single step when it is actually 8 steps. (3) Tool hallucination - assuming a specific API or command exists that does not. (4) Dependency inversion - ordering tasks in a way that requires a result before it is produced. (5) Horizon collapse - beyond ~7 steps, plans become internally inconsistent. I address these with validation (catch before execution), small step sizes (reduce underestimation), tool schemas (eliminate tool hallucination), and horizon limits (break long plans into segments).
Q: How do you evaluate whether a generated plan is good before executing it?
A: Multiple levels: (1) Structural validation - does the dependency order make sense? Are there missing prerequisites? Can be checked with topological sort + schema validation. (2) LLM critique - ask a second LLM call to identify issues in the plan before execution (the "validation" step in the demo). (3) Domain-specific checks - for software tasks, does the plan include testing? For data tasks, does it include data validation? These can be rule-based. (4) Cost estimation - is the estimated cost/time within acceptable bounds? (5) Reversibility check - are any steps irreversible? If so, require human confirmation before executing.
