What is LLM planning?

How LLM agents handle complex multi-step tasks through plan-and-execute, hierarchical planning, self-reflection, and LangGraph-based workflows.

How does plan and execute work in practice?

Planning and Reasoning covers LLM planning, plan and execute, hierarchical agents from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-agents/planning-and-reasoning

What is the difference between LLM planning and hierarchical agents?

See the full breakdown at https://engineersofai.com/docs/llms/llm-agents/planning-and-reasoning

Planning and Reasoning

A Production Scenario

You are building an AI research assistant for a consulting firm. The task: given a company name, produce a competitive intelligence report covering financials, recent news, product positioning, key personnel changes, and market share estimates. The report needs to synthesize information from at least a dozen sources and be structured for a partner-level audience.

Your first implementation uses a ReAct agent with search tools. The agent starts searching, discovers one piece of information, follows that thread, discovers something else, follows that, and after fifteen minutes and forty tool calls produces a report that is thorough on some topics and completely absent on others. It went deep on the company's recent acquisition because a news article mentioned it, and forgot to check financials entirely.

ReAct is a greedy algorithm. It always takes the locally best-looking next step. For a task this large, greedy is wrong - the agent needs to think about the full scope of the task before it starts executing. It needs a plan.

Your second implementation generates a full research plan first: seven parallel research tracks, each with specific questions to answer, then synthesizes the results. The plan itself is excellent. But execution reveals a problem: the plan was written before the agent knew what it would find, and the world did not cooperate. The target company just got acquired, which makes half the planned research irrelevant and opens new questions the plan never anticipated. A rigid plan executed blindly is almost as bad as no plan.

What you actually need is a planner that creates a structured roadmap but monitors execution results and updates the plan when reality diverges from assumptions. This is the core challenge of LLM planning: how do you get the benefits of upfront structure without the brittleness of a fixed plan?

Why This Exists

The Limits of Greedy Action Selection

ReAct selects the next action one step at a time based on the current context. This works well for tasks where each step is cheap to reverse or where early information genuinely guides later decisions. It breaks down when:

Tasks are long: A 20-step task with greedy selection means each step only sees 1/20 of the needed context at decision time
Sub-tasks are parallelizable: If steps 2, 3, and 4 are independent, greedy execution runs them serially and wastes time
Early mistakes compound: A wrong assumption in step 3 may not reveal itself until step 15, by which point the agent has built an elaborate wrong structure
The full task requires global reasoning: Deciding whether to investigate the financials before or after the product analysis requires understanding the full scope of the report

Why Simple Prompting Fails

The obvious fix is to prompt the model to "think about the full task before acting." But without a structured planning mechanism, this is just chain-of-thought with extra steps. The model produces a plan as prose, then ignores half of it during execution because it is not mechanistically enforcing the plan.

Real planning requires: a structured plan representation that can be tracked and updated, an executor that follows the plan, a monitor that detects when execution diverges from the plan, and a replanning mechanism for when the original plan no longer fits.

Historical Context

Classical AI Planning (STRIPS, Fikes and Nilsson 1971) - the original formulation of automated planning as state-space search. Define an initial state, a goal state, and operators that transition between states. Find a sequence of operators that reaches the goal. Works perfectly for deterministic, fully-observable environments with complete domain models.

PDDL (McDermott et al., 1998) - Planning Domain Definition Language, the standard representation for classical planning. Still used in robotics and logistics today.

LLM-based Planning diverged from classical planning because: (1) LLMs cannot enumerate all states, (2) real-world tasks are not fully observable, (3) the action space is infinite and natural language. The shift was toward approximate planning: generate a plausible plan, execute it, adapt when reality diverges.

Plan-and-Solve (Wang et al., 2023) - showed that asking LLMs to "devise a plan and then solve" outperforms chain-of-thought on math and reasoning benchmarks.

ReWOO (Xu et al., 2023) - Reasoning Without Observation. Generate the full plan including all tool calls before executing any of them. Reduces the number of LLM calls by predicting what tools will return.

LLM Compiler (Kim et al., 2023) - extends ReWOO to identify dependencies between plan steps and execute independent steps in parallel, like a CPU instruction scheduler.

LATS (Zhou et al., 2023) - Language Agent Tree Search. Combines MCTS (Monte Carlo Tree Search) with LLM agents. Maintains a tree of possible action sequences, uses a value function to estimate which branches are promising, and explores multiple paths before committing to one.

Plan-and-Execute

The simplest departure from pure ReAct: generate a full step-by-step plan, then execute each step, updating the plan if needed.

import anthropic
import json
import re
from dataclasses import dataclass, field

client = anthropic.Anthropic()


@dataclass
class PlanStep:
    step_number: int
    description: str
    tool_name: str | None = None
    tool_args: dict = field(default_factory=dict)
    result: str = ""
    completed: bool = False


@dataclass
class ExecutionPlan:
    goal: str
    steps: list[PlanStep]
    context: dict = field(default_factory=dict)


def generate_plan(task: str, available_tools: list[str]) -> ExecutionPlan:
    """Use LLM to generate a structured execution plan."""
    tools_desc = "\n".join(f"- {t}" for t in available_tools)
    prompt = f"""You are a planning agent. Given a task, create a step-by-step execution plan.

Available tools:
{tools_desc}

Task: {task}

Output a JSON plan with this exact format:
{{
  "goal": "the main goal",
  "steps": [
    {{
      "step_number": 1,
      "description": "What this step does",
      "tool_name": "tool_to_call or null if synthesis step",
      "tool_args": {{"arg": "value"}}
    }}
  ]
}}

Guidelines:
- Make the plan specific and actionable
- Order steps so earlier results inform later steps
- Include a final synthesis step
- Maximum 8 steps
"""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    raw = response.content[0].text
    json_match = re.search(r'\{.*\}', raw, re.DOTALL)
    if not json_match:
        raise ValueError("Planner did not produce valid JSON")

    plan_data = json.loads(json_match.group())
    steps = [PlanStep(**step) for step in plan_data["steps"]]
    return ExecutionPlan(goal=plan_data["goal"], steps=steps)


def execute_step(step: PlanStep, tool_registry: dict, context: dict) -> str:
    """Execute a single plan step, injecting context from previous steps."""
    if step.tool_name is None:
        # Synthesis step
        context_text = "\n\n".join(
            f"Step {k} result:\n{v}" for k, v in context.items()
        )
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": (
                    f"Based on these research results:\n\n{context_text}\n\n"
                    f"Task: {step.description}"
                )
            }]
        )
        return response.content[0].text

    if step.tool_name not in tool_registry:
        return f"Error: Tool '{step.tool_name}' not found"

    try:
        result = tool_registry[step.tool_name](**step.tool_args)
        return json.dumps(result) if isinstance(result, dict) else str(result)
    except Exception as e:
        return f"Tool error: {e}"


def replan_if_needed(
    plan: ExecutionPlan,
    completed_step: PlanStep,
    remaining_steps: list[PlanStep]
) -> list[PlanStep]:
    """Check if the plan needs updating based on what we learned."""
    prompt = f"""A planning agent is executing a task.

Original goal: {plan.goal}

Most recent step completed:
Step {completed_step.step_number}: {completed_step.description}
Result: {completed_step.result[:500]}

Remaining planned steps:
{json.dumps([{"step_number": s.step_number, "description": s.description}
             for s in remaining_steps], indent=2)}

Do the remaining steps still make sense?
If yes, respond with: {{"needs_replan": false}}
If no, respond with: {{"needs_replan": true, "new_steps": [...]}}
"""
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
        json_match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
        if json_match:
            data = json.loads(json_match.group())
            if data.get("needs_replan") and "new_steps" in data:
                print(f"  [REPLAN] Plan updated")
                return [PlanStep(**s) for s in data["new_steps"]]
    except Exception:
        pass
    return remaining_steps


def plan_and_execute_agent(task: str, tool_registry: dict) -> str:
    """Full plan-and-execute agent with replanning capability."""
    available_tools = list(tool_registry.keys())

    print(f"[PLANNER] Generating plan for: {task}")
    plan = generate_plan(task, available_tools)

    print(f"[PLAN] {len(plan.steps)} steps:")
    for step in plan.steps:
        print(f"  Step {step.step_number}: {step.description}")

    remaining_steps = plan.steps.copy()
    final_answer = ""

    while remaining_steps:
        current_step = remaining_steps.pop(0)
        print(f"\n[EXECUTE] Step {current_step.step_number}: {current_step.description}")

        result = execute_step(current_step, tool_registry, plan.context)
        current_step.result = result
        current_step.completed = True
        plan.context[f"step_{current_step.step_number}"] = result
        print(f"  Result: {result[:200]}...")

        if remaining_steps:
            remaining_steps = replan_if_needed(plan, current_step, remaining_steps)

        if not remaining_steps and current_step.tool_name is None:
            final_answer = result

    return final_answer or plan.context.get(f"step_{len(plan.steps)}", "")

Hierarchical Planning: Manager and Worker Agents

For very complex tasks, a single planner managing all steps becomes a bottleneck. Hierarchical planning splits the problem: a manager agent creates a high-level plan and delegates subtasks to specialized worker agents.

from typing import Callable
from dataclasses import dataclass


@dataclass
class SubTask:
    name: str
    description: str
    assigned_to: str
    result: str = ""
    completed: bool = False


class ManagerAgent:
    """Breaks down a complex task into subtasks and delegates them."""

    def __init__(self, workers: dict[str, Callable]):
        self.workers = workers
        self.client = anthropic.Anthropic()

    def decompose_task(self, task: str) -> list[SubTask]:
        workers_desc = "\n".join(
            f"- {name}: {fn.__doc__ or 'specialist'}"
            for name, fn in self.workers.items()
        )
        response = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": (
                    f"Break this task into subtasks for specialists.\n\n"
                    f"Specialists:\n{workers_desc}\n\n"
                    f"Task: {task}\n\n"
                    f"Output JSON: {{\"subtasks\": [{{\"name\": str, "
                    f"\"description\": str, \"assigned_to\": str}}]}}"
                )
            }]
        )
        json_match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
        if not json_match:
            return []
        data = json.loads(json_match.group())
        return [SubTask(**st) for st in data.get("subtasks", [])]

    def run(self, task: str) -> str:
        subtasks = self.decompose_task(task)
        print(f"[MANAGER] {len(subtasks)} subtasks decomposed")

        for subtask in subtasks:
            if subtask.assigned_to in self.workers:
                result = self.workers[subtask.assigned_to](subtask.description)
                subtask.result = result
                subtask.completed = True

        results_text = "\n\n".join(
            f"### {st.name}\n{st.result}"
            for st in subtasks if st.completed
        )
        response = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": (
                    f"Synthesize into a coherent answer for: {task}\n\n{results_text}"
                )
            }]
        )
        return response.content[0].text

Plan-and-Execute with LangGraph

LangGraph provides first-class support for plan-and-execute patterns through its graph-based workflow system.

from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from typing import TypedDict

llm = ChatAnthropic(model="claude-opus-4-6")


class PlanExecuteState(TypedDict):
    task: str
    plan: list[str]
    past_steps: list[tuple[str, str]]
    response: str


def planner_node(state: PlanExecuteState) -> dict:
    response = llm.invoke([
        HumanMessage(content=(
            f"Create a numbered step-by-step plan to accomplish:\n{state['task']}\n\n"
            f"Output only numbered steps, one per line. Maximum 6 steps."
        ))
    ])
    import re
    steps = re.findall(r'\d+\.\s+(.+)', response.content)
    return {"plan": steps}


def executor_node(state: PlanExecuteState) -> dict:
    if not state["plan"]:
        return {}
    current_step = state["plan"][0]
    past_text = "\n".join(f"- {s}: {r[:80]}" for s, r in state.get("past_steps", []))

    response = llm.invoke([
        HumanMessage(content=(
            f"Task: {state['task']}\n"
            f"Completed steps:\n{past_text}\n\n"
            f"Execute this step: {current_step}\n"
            f"Provide a concrete result."
        ))
    ])
    return {
        "past_steps": state.get("past_steps", []) + [(current_step, response.content)],
        "plan": state["plan"][1:]
    }


def replanner_node(state: PlanExecuteState) -> dict:
    if not state["plan"]:
        past = "\n".join(f"Step: {s}\nResult: {r}" for s, r in state["past_steps"])
        response = llm.invoke([
            HumanMessage(content=(
                f"Task: {state['task']}\n\n"
                f"Completed steps:\n{past}\n\n"
                f"Synthesize a final, complete answer."
            ))
        ])
        return {"response": response.content}
    return {}


def should_continue(state: PlanExecuteState) -> str:
    return "done" if state.get("response") else "execute"


graph = StateGraph(PlanExecuteState)
graph.add_node("planner", planner_node)
graph.add_node("executor", executor_node)
graph.add_node("replanner", replanner_node)

graph.set_entry_point("planner")
graph.add_edge("planner", "executor")
graph.add_edge("executor", "replanner")
graph.add_conditional_edges(
    "replanner",
    should_continue,
    {"execute": "executor", "done": END}
)

app = graph.compile()

result = app.invoke({
    "task": "Research the top 3 Python web frameworks and compare their performance",
    "plan": [],
    "past_steps": [],
    "response": ""
})
print(result["response"])

Reflection and Self-Critique

Self-reflection adds a quality gate: after executing a step or completing a draft answer, the agent critiques its own output and decides whether to revise.

def reflection_step(
    task: str,
    draft_answer: str,
    criteria: list[str]
) -> tuple[str, bool]:
    """Have the agent critique its own output. Returns (critique, needs_revision)."""
    criteria_text = "\n".join(f"- {c}" for c in criteria)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Review this response for: {task}\n\n"
                f"Response:\n{draft_answer}\n\n"
                f"Criteria:\n{criteria_text}\n\n"
                f"Output JSON: {{\"critique\": str, \"score\": 1-10, "
                f"\"needs_revision\": bool}}"
            )
        }]
    )
    json_match = re.search(r'\{.*\}', response.content[0].text, re.DOTALL)
    if json_match:
        data = json.loads(json_match.group())
        return data.get("critique", ""), data.get("needs_revision", False)
    return "Could not parse", False


def generate_with_reflection(task: str, max_revisions: int = 3) -> str:
    """Generate an answer and revise it based on self-critique."""
    criteria = [
        "Factually accurate - no hallucinated claims",
        "Complete - addresses all parts of the task",
        "Clear - readable for the intended audience",
        "Concise - no unnecessary verbosity"
    ]

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": task}]
    )
    draft = response.content[0].text

    for revision_num in range(max_revisions):
        critique, needs_revision = reflection_step(task, draft, criteria)
        print(f"[Revision {revision_num + 1}] {critique[:100]}...")

        if not needs_revision:
            break

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": (
                    f"Original task: {task}\n\n"
                    f"Previous answer:\n{draft}\n\n"
                    f"Critique:\n{critique}\n\n"
                    f"Please revise to address these issues."
                )
            }]
        )
        draft = response.content[0].text

    return draft

ReWOO: Planning Without Observation

ReWOO (Reasoning Without Observation) generates the complete plan including predicted tool call arguments before executing any tools. It reduces LLM calls by collapsing the plan phase into one pass.

def rewoo_agent(task: str, tool_registry: dict) -> str:
    """
    ReWOO: Generate full plan with tool calls upfront, then execute them.
    Saves LLM calls at the cost of flexibility.
    """
    tools_desc = "\n".join(
        f"#{name}: {fn.__doc__ or 'no description'}"
        for name, fn in tool_registry.items()
    )

    # Step 1: Plan with tool call predictions
    plan_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                f"You are a planner. Generate a complete execution plan.\n\n"
                f"Available tools:\n{tools_desc}\n\n"
                f"Task: {task}\n\n"
                f"Format:\n"
                f"Plan: [description]\n"
                f"#E1 = tool_name[argument]\n"
                f"Plan: [next step using #E1 result]\n"
                f"#E2 = tool_name[argument using #E1]\n"
                f"Use #E1, #E2, etc. to reference previous results."
            )
        }]
    )
    plan_text = plan_response.content[0].text

    # Step 2: Execute all tool calls
    tool_calls = re.findall(r'(#E\d+)\s*=\s*(\w+)\[([^\]]*)\]', plan_text)
    evidence = {}

    for var_name, tool_name, tool_arg in tool_calls:
        for prev_var, prev_val in evidence.items():
            tool_arg = tool_arg.replace(prev_var, str(prev_val)[:100])

        if tool_name in tool_registry:
            try:
                result = tool_registry[tool_name](tool_arg)
                evidence[var_name] = result
            except Exception as e:
                evidence[var_name] = f"Error: {e}"

    # Step 3: Generate final answer using plan + evidence
    evidence_text = "\n".join(f"{k}: {v}" for k, v in evidence.items())
    final_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                f"Task: {task}\n\n"
                f"Plan:\n{plan_text}\n\n"
                f"Tool results:\n{evidence_text}\n\n"
                f"Provide the final answer."
            )
        }]
    )
    return final_response.content[0].text

When Planning Helps vs Just Using ReAct

Scenario	Best Approach	Why
"What is the capital of France?"	Direct prompt	No planning needed
"Find today's stock price for AAPL"	ReAct (1 tool call)	Single step
"Write a 5-section competitive analysis"	Plan-and-Execute	Need upfront structure
"Debug this Python error"	ReAct	Each step depends on previous observations
"Research 10 companies and rank them"	Plan-and-Execute	Structured, partially parallelizable
"Generate and test code iteratively"	ReAct	Output shapes the next iteration
"Generate a research paper on X"	Hierarchical planning	Large enough for manager/worker split

The threshold: if the task has a clear structure identifiable before execution, plan. If the task requires adapting to what you discover, use ReAct. Complex tasks often need both - plan the structure, use ReAct within each step.

Production Engineering Notes

Plan Caching

Plans for similar tasks often look identical. Cache plans by task template and reuse them.

from functools import lru_cache
import hashlib

def task_to_template(task: str) -> str:
    """Strip specific values to get a task template for cache lookup."""
    import re
    # Remove specific names, numbers, dates
    template = re.sub(r'\b[A-Z][a-z]+\b', 'ENTITY', task)
    template = re.sub(r'\d+', 'NUM', template)
    return template.lower().strip()


_plan_cache: dict[str, list[str]] = {}

def get_or_create_plan(task: str, tools: list[str]) -> list[str]:
    template = task_to_template(task)
    if template in _plan_cache:
        print(f"[CACHE HIT] Reusing plan for template: {template[:50]}")
        return _plan_cache[template]

    plan = generate_plan(task, tools)
    _plan_cache[template] = [s.description for s in plan.steps]
    return _plan_cache[template]

Parallelizing Independent Steps

When plan steps have no dependencies, execute them concurrently.

import asyncio

async def execute_steps_parallel(
    independent_steps: list[PlanStep],
    tool_registry: dict,
    context: dict
) -> list[str]:
    """Execute independent steps in parallel."""
    tasks = [
        asyncio.to_thread(execute_step, step, tool_registry, context)
        for step in independent_steps
    ]
    return await asyncio.gather(*tasks)

Common Mistakes

:::danger Generating Plans Too Early Without Context A plan generated before any tool calls has zero information about the current state of the world. For fact-finding tasks, do one exploratory search first, then plan. "Plan-then-research" often produces worse plans than "research-then-plan." :::

:::danger Following a Broken Plan Rigidly If step 3 produces an error or unexpected result, blindly executing step 4 compounds the problem. Implement a monitor after each execution step that checks whether the plan still makes sense. :::

:::warning Plans That Are Too Fine-Grained A plan with 25 micro-steps requires 25 LLM calls and creates massive overhead. Keep plans high-level (5-8 steps maximum). Let each step be handled by a ReAct sub-agent if needed. :::

:::warning Not Handling Parallel Opportunities If your plan-and-execute runs all steps serially and several are independent, you are wasting time. Identify steps with no dependencies and run them concurrently with asyncio.gather. :::

Interview Q&A

Q: What is the difference between ReAct and Plan-and-Execute?

ReAct is greedy: it decides the next single action based on the current state. It cannot plan ahead. Plan-and-Execute generates a complete plan before execution begins, which allows for parallel steps, better structure on long tasks, and avoiding early decision traps. The tradeoff: Plan-and-Execute requires one LLM call upfront for planning (which may be wrong), while ReAct adapts perfectly to observations but can get lost on long tasks.

Q: What is ReWOO and when is it better than ReAct?

ReWOO (Reasoning Without Observation) generates the entire plan including all tool call arguments before executing any tools. This reduces the number of LLM calls needed (one planning call instead of one per step) at the cost of flexibility. ReWOO is better when: tool calls are predictable, arguments to later tools can be determined from the task description alone, and minimizing latency and cost is important. It is worse when later tool calls depend on the actual results of earlier ones.

Q: How do you detect when an agent is going off-plan?

Monitor execution by comparing the current state against plan expectations after each step. Specifically: did the step produce the type of output expected? Does the observation suggest that prior assumptions were wrong? Is the remaining plan still coherent given what we now know? A replanner node handles this by prompting the LLM to evaluate whether the remaining steps still make sense given the latest result.

Q: What is LATS and why is it expensive?

LATS (Language Agent Tree Search) applies Monte Carlo Tree Search to LLM agent trajectories. Instead of following one path, it maintains a tree of possible action sequences. At each node, it expands multiple possible next actions, evaluates their promise using a value function, and explores the most promising branches. This finds better solutions on hard tasks but requires many more LLM calls - potentially 10-50x more than a linear ReAct trace. Use it only for tasks where quality matters far more than cost.

Q: When should you use hierarchical planning?

Use hierarchical planning when the task is large enough that a single agent's context window would fill up with the full plan and all intermediate results. The manager agent handles decomposition and synthesis; worker agents handle execution. This also enables specialization: a code-writing agent, a research agent, and a critique agent can each be optimized for their role with tailored prompts and tools.

Q: How does self-reflection improve plan quality?

Self-reflection adds a quality gate where the agent critiques its own outputs against explicit criteria before marking a step complete. This catches errors earlier in the pipeline, before downstream steps build on wrong results. The key is making the reflection criteria explicit and measurable - vague criteria like "is this good?" produce weak critiques, while specific criteria like "does this cite at least 3 sources?" produce actionable feedback.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Agent Planning & Task Decomposition demo on the EngineersOfAI Playground - no code required.

:::

A Production Scenario​

Why This Exists​

The Limits of Greedy Action Selection​

Why Simple Prompting Fails​

Historical Context​

Plan-and-Execute​

Hierarchical Planning: Manager and Worker Agents​

Plan-and-Execute with LangGraph​

Reflection and Self-Critique​

ReWOO: Planning Without Observation​

When Planning Helps vs Just Using ReAct​

Production Engineering Notes​

Plan Caching​

Parallelizing Independent Steps​

Common Mistakes​

Interview Q&A​