What is browser agent?

Building practical browser agents using Playwright and LLMs - DOM manipulation, visual navigation, session management, anti-bot handling, and complete Python implementation.

How does Playwright work in practice?

Browser Agents covers browser agent, Playwright, web automation from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/computer-use-agents/browser-agents

What is the difference between browser agent and web automation?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/computer-use-agents/browser-agents

Browser Agents

The Procurement Manager's Problem

Sarah manages procurement for a regional hospital network. Part of her job involves checking supplier pricing across seven different vendor portals - each one custom-built, each one requiring login, each one presenting data in a different format and layout. Some have APIs. Most don't.

Every Monday morning she spends three hours opening tabs, logging in, navigating to the right product categories, exporting data (if export is available, otherwise copying by hand), and pasting it into a master spreadsheet. The whole thing is manual. The whole thing is tedious. And the whole thing is exactly the kind of task that a browser agent can handle.

Browser agents are AI agents that interact with web pages using a combination of browser automation (Playwright, Selenium) and large language model reasoning. They can log in, navigate, fill forms, click through pagination, extract structured data, and handle the unpredictable reality of real web interfaces - dynamic content, session timeouts, unexpected modals, CAPTCHA challenges, and page layouts that change without warning.

They are not magic. They fail. But when built correctly, they handle failure gracefully and adapt in ways that brittle CSS-selector scripts never can.

This lesson is about building browser agents that actually work in production.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Computer Use Agents demo on the EngineersOfAI Playground - no code required. :::

Why Browser Agents, Not Just Playwright?

Playwright is an excellent browser automation library. You can drive Chrome, Firefox, or Safari programmatically: navigate pages, click elements, fill forms, extract data. For a stable web application with known structure, Playwright alone is the right tool.

But pure Playwright breaks in five common scenarios:

1. JavaScript-heavy single-page applications: The DOM changes dynamically after the initial load. Element IDs shift. React re-renders components. A hardcoded selector that worked yesterday fails today after a UI update.

2. Unpredictable navigation paths: Some workflows have conditional steps. "If the user has a business account, show the bulk pricing tab; otherwise show retail." Playwright code needs explicit if-else for every branch. An LLM can reason about which path to take.

3. CAPTCHA and bot detection: reCAPTCHA, hCaptcha, Cloudflare Turnstile. Playwright can solve simple challenges programmatically, but many require reasoning about visual puzzles.

4. Error recovery: What happens when a form submission returns an error? When a session expires mid-workflow? When a popup appears unexpectedly? Playwright code needs explicit error handling for every known failure mode. An LLM can see an unexpected error message and adapt.

5. Ambiguous instructions: "Find the best deal on a 16GB DDR5 memory kit" cannot be expressed as a fixed sequence of clicks. It requires understanding search results, comparing options, and making a judgment call.

The solution is to combine Playwright's reliable browser control with LLM reasoning for navigation, interpretation, and error recovery.

Three Architectural Approaches

Approach 1: DOM-Based (Pure Playwright) Interact with the page through the DOM directly. Use CSS selectors, XPath, or accessibility attributes. Fast (no LLM per action). Brittle when structure changes.

Best for: High-volume, stable-structure tasks. Login flows to known systems. Extracting data from consistent HTML.

Approach 2: Vision-Based (Screenshot + LLM) Take screenshots and let the LLM decide what to click. Uses the computer use architecture from Lesson 01. Handles any interface but is slow and expensive.

Best for: Unknown or frequently-changing interfaces. Tasks requiring visual reasoning. Desktop applications embedded in web views.

Approach 3: Hybrid (Recommended for most use cases) Use the LLM to plan high-level navigation steps (which page to go to, what to click), but use Playwright's DOM access for reliable data extraction. Vision for reasoning, DOM for precision.

Playwright as the Action Layer

Playwright is Anthropic's recommended action layer for browser agents (also used in the official computer use demo). It provides:

from playwright.sync_api import sync_playwright, Page, Browser

# Browser control
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1280, "height": 720},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    )
    page = context.new_page()

    # Navigation
    page.goto("https://example.com", wait_until="networkidle")

    # Finding elements
    button = page.locator("button[type=submit]")
    input_field = page.locator("input[name=email]")
    any_element = page.locator("text=Submit Order")  # text-based selection

    # Actions
    page.click("button.login-btn")
    page.fill("input[name=password]", "secret")
    page.press("input[name=search]", "Enter")
    page.select_option("select[name=category]", "electronics")

    # Scrolling
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    page.mouse.wheel(0, 300)

    # Waiting
    page.wait_for_selector(".results-loaded", timeout=10000)
    page.wait_for_load_state("networkidle")

    # Screenshots (for the LLM)
    screenshot_bytes = page.screenshot(full_page=True)

    # Data extraction
    title = page.inner_text("h1.product-title")
    price = page.inner_text(".price-display")
    all_products = page.query_selector_all(".product-card")

Session Management

Real-world browser automation requires maintaining session state across multiple pages and potentially multiple visits.

import json
from pathlib import Path
from playwright.sync_api import BrowserContext


class SessionManager:
    """
    Manages browser session state: cookies, local storage, auth tokens.
    Enables resuming sessions without re-logging in.
    """

    def __init__(self, session_dir: str = "/tmp/browser_sessions"):
        self.session_dir = Path(session_dir)
        self.session_dir.mkdir(parents=True, exist_ok=True)

    def save_session(self, context: BrowserContext, session_name: str) -> None:
        """Save cookies and local storage for later reuse."""
        storage_state = context.storage_state()
        session_file = self.session_dir / f"{session_name}.json"
        session_file.write_text(json.dumps(storage_state, indent=2))
        print(f"Session saved: {session_file}")

    def load_session(self, playwright, session_name: str):
        """Load a saved session into a new browser context."""
        session_file = self.session_dir / f"{session_name}.json"

        browser = playwright.chromium.launch(headless=True)

        if session_file.exists():
            print(f"Loading existing session: {session_name}")
            context = browser.new_context(storage_state=str(session_file))
        else:
            print(f"No existing session, starting fresh")
            context = browser.new_context()

        return browser, context

    def session_exists(self, session_name: str) -> bool:
        return (self.session_dir / f"{session_name}.json").exists()

    def clear_session(self, session_name: str) -> None:
        session_file = self.session_dir / f"{session_name}.json"
        if session_file.exists():
            session_file.unlink()
            print(f"Session cleared: {session_name}")

Complete Browser Agent Implementation

Now let's build a complete browser agent. The task: extract product pricing from an e-commerce site, handling login, search, filtering, and pagination.

"""
browser_agent.py

A complete browser agent that:
1. Logs into an e-commerce site
2. Searches for products
3. Filters results
4. Handles pagination
5. Extracts structured data
6. Handles errors gracefully

Uses Anthropic Claude for navigation reasoning + Playwright for actions.
"""

import anthropic
import base64
import json
import time
import re
from pathlib import Path
from typing import Optional
from dataclasses import dataclass, field
from playwright.sync_api import sync_playwright, Page, TimeoutError as PlaywrightTimeout
from pydantic import BaseModel


# --- Data Models ---

class Product(BaseModel):
    name: str
    price: float
    currency: str = "USD"
    url: str
    rating: Optional[float] = None
    review_count: Optional[int] = None
    in_stock: bool = True
    specs: dict = {}


class ExtractionResult(BaseModel):
    products: list[Product]
    page_number: int
    total_pages: Optional[int]
    search_query: str
    filters_applied: list[str]


# --- Navigation Tools ---

class BrowserTools:
    """
    Tools available to the browser agent.
    Each tool corresponds to a Playwright action.
    """

    def __init__(self, page: Page):
        self.page = page

    def screenshot(self) -> str:
        """Take a screenshot and return as base64."""
        screenshot_bytes = self.page.screenshot(full_page=False)
        return base64.standard_b64encode(screenshot_bytes).decode("utf-8")

    def navigate(self, url: str) -> dict:
        """Navigate to a URL."""
        try:
            self.page.goto(url, wait_until="domcontentloaded", timeout=30000)
            time.sleep(1)  # Let JS settle
            return {"success": True, "url": self.page.url}
        except PlaywrightTimeout:
            return {"success": False, "error": "Navigation timed out"}
        except Exception as e:
            return {"success": False, "error": str(e)}

    def click(self, selector: str) -> dict:
        """Click an element by CSS selector or text."""
        try:
            # Try CSS selector first
            element = self.page.locator(selector).first
            element.click(timeout=5000)
            time.sleep(0.5)
            return {"success": True}
        except Exception:
            # Try text-based selection
            try:
                self.page.click(f"text={selector}", timeout=5000)
                time.sleep(0.5)
                return {"success": True}
            except Exception as e:
                return {"success": False, "error": f"Could not click '{selector}': {e}"}

    def type_text(self, selector: str, text: str, press_enter: bool = False) -> dict:
        """Type text into an input field."""
        try:
            self.page.fill(selector, text, timeout=5000)
            if press_enter:
                self.page.press(selector, "Enter")
                time.sleep(1)
            return {"success": True}
        except Exception as e:
            return {"success": False, "error": str(e)}

    def scroll_down(self, amount: int = 500) -> dict:
        """Scroll down by pixels."""
        self.page.evaluate(f"window.scrollBy(0, {amount})")
        time.sleep(0.3)
        return {"success": True}

    def extract_text(self, selector: str) -> str:
        """Extract text from an element."""
        try:
            return self.page.inner_text(selector, timeout=3000)
        except Exception:
            return ""

    def get_page_source(self) -> str:
        """Get full page HTML (for complex extraction)."""
        return self.page.content()

    def wait_for_element(self, selector: str, timeout: int = 10000) -> dict:
        """Wait for an element to appear."""
        try:
            self.page.wait_for_selector(selector, timeout=timeout)
            return {"success": True}
        except PlaywrightTimeout:
            return {"success": False, "error": f"Element '{selector}' not found"}

    def get_current_url(self) -> str:
        return self.page.url


# --- Main Browser Agent ---

class BrowserAgent:
    """
    Browser agent that uses Claude for navigation reasoning
    and Playwright for browser control.
    """

    SYSTEM_PROMPT = """You are a browser automation agent. You control a web browser to complete tasks.

You have the following tools:
- screenshot: Take a screenshot to see the current page state
- navigate(url): Go to a URL
- click(selector): Click an element (CSS selector or visible text)
- type_text(selector, text, press_enter): Fill an input field
- scroll_down(amount): Scroll down by pixels
- extract_text(selector): Get text from an element
- wait_for_element(selector): Wait for an element to appear

Guidelines:
1. Always take a screenshot first to understand the current page state
2. For login forms: locate username/email field, fill it, locate password field, fill it, click submit
3. For search: find the search bar, type the query, press Enter or click search button
4. For pagination: look for "Next" button or page numbers
5. After each significant action, take a screenshot to verify success
6. If something fails, take a screenshot and try an alternative approach
7. When you have extracted all needed data, respond with a JSON summary

For data extraction, return a JSON object with this structure:
{
  "products": [
    {
      "name": "product name",
      "price": 99.99,
      "url": "https://...",
      "rating": 4.5,
      "in_stock": true
    }
  ],
  "page_number": 1,
  "total_pages": 5,
  "has_next_page": true
}

IMPORTANT: Only interact with elements you can see in the screenshot. Do not guess at selectors.
"""

    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)
        self._define_tools()

    def _define_tools(self):
        """Define tools for Claude to call."""
        self.tools = [
            {
                "name": "screenshot",
                "description": "Take a screenshot of the current browser state",
                "input_schema": {
                    "type": "object",
                    "properties": {},
                    "required": []
                }
            },
            {
                "name": "navigate",
                "description": "Navigate the browser to a URL",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "url": {"type": "string", "description": "The URL to navigate to"}
                    },
                    "required": ["url"]
                }
            },
            {
                "name": "click",
                "description": "Click an element on the page by CSS selector or visible text",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "selector": {
                            "type": "string",
                            "description": "CSS selector or visible text of element to click"
                        }
                    },
                    "required": ["selector"]
                }
            },
            {
                "name": "type_text",
                "description": "Type text into an input field",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "selector": {"type": "string", "description": "CSS selector for the input"},
                        "text": {"type": "string", "description": "Text to type"},
                        "press_enter": {
                            "type": "boolean",
                            "description": "Press Enter after typing",
                            "default": False
                        }
                    },
                    "required": ["selector", "text"]
                }
            },
            {
                "name": "scroll_down",
                "description": "Scroll down the page",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "amount": {
                            "type": "integer",
                            "description": "Pixels to scroll (default 500)",
                            "default": 500
                        }
                    },
                    "required": []
                }
            },
            {
                "name": "extract_text",
                "description": "Extract text from an element",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "selector": {"type": "string", "description": "CSS selector"}
                    },
                    "required": ["selector"]
                }
            },
            {
                "name": "wait_for_element",
                "description": "Wait for an element to appear on the page",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "selector": {"type": "string"},
                        "timeout": {
                            "type": "integer",
                            "description": "Milliseconds to wait",
                            "default": 10000
                        }
                    },
                    "required": ["selector"]
                }
            }
        ]

    def _process_tool_call(
        self, tool_name: str, tool_input: dict, browser_tools: BrowserTools
    ) -> str:
        """Execute a tool call and return result as string."""
        print(f"  Tool: {tool_name}({json.dumps(tool_input)})")

        if tool_name == "screenshot":
            b64 = browser_tools.screenshot()
            return b64  # Will be handled as image

        elif tool_name == "navigate":
            result = browser_tools.navigate(tool_input["url"])
            return json.dumps(result)

        elif tool_name == "click":
            result = browser_tools.click(tool_input["selector"])
            return json.dumps(result)

        elif tool_name == "type_text":
            result = browser_tools.type_text(
                tool_input["selector"],
                tool_input["text"],
                tool_input.get("press_enter", False)
            )
            return json.dumps(result)

        elif tool_name == "scroll_down":
            result = browser_tools.scroll_down(tool_input.get("amount", 500))
            return json.dumps(result)

        elif tool_name == "extract_text":
            text = browser_tools.extract_text(tool_input["selector"])
            return text or "(element not found or empty)"

        elif tool_name == "wait_for_element":
            result = browser_tools.wait_for_element(
                tool_input["selector"],
                tool_input.get("timeout", 10000)
            )
            return json.dumps(result)

        return f"Unknown tool: {tool_name}"

    def run(
        self,
        start_url: str,
        task: str,
        credentials: Optional[dict] = None,
        max_steps: int = 40
    ) -> dict:
        """
        Run the browser agent.

        Args:
            start_url: Where to start
            task: What to do
            credentials: Optional dict with 'username' and 'password'
            max_steps: Maximum tool calls before stopping
        """

        with sync_playwright() as playwright:
            browser = playwright.chromium.launch(
                headless=True,
                args=[
                    "--no-sandbox",
                    "--disable-blink-features=AutomationControlled",
                ]
            )

            context = browser.new_context(
                viewport={"width": 1280, "height": 720},
                user_agent=(
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/120.0.0.0 Safari/537.36"
                )
            )

            page = context.new_page()
            browser_tools = BrowserTools(page)

            # Navigate to starting URL
            nav_result = browser_tools.navigate(start_url)
            if not nav_result["success"]:
                return {"success": False, "error": f"Failed to load {start_url}"}

            # Take initial screenshot
            initial_screenshot = browser_tools.screenshot()

            # Build task message
            task_content = task
            if credentials:
                task_content += (
                    f"\n\nCredentials: username='{credentials['username']}', "
                    f"password='{credentials['password']}'"
                )

            messages = [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/png",
                                "data": initial_screenshot,
                            }
                        },
                        {"type": "text", "text": task_content}
                    ]
                }
            ]

            step_count = 0
            extracted_data = None

            while step_count < max_steps:
                step_count += 1
                print(f"\nStep {step_count}/{max_steps}")

                response = self.client.messages.create(
                    model="claude-3-5-sonnet-20241022",
                    max_tokens=4096,
                    system=self.SYSTEM_PROMPT,
                    tools=self.tools,
                    messages=messages,
                )

                messages.append({
                    "role": "assistant",
                    "content": response.content
                })

                if response.stop_reason == "end_turn":
                    # Extract final result
                    for block in response.content:
                        if hasattr(block, "text"):
                            # Try to parse JSON from the response
                            try:
                                json_match = re.search(
                                    r'\{[\s\S]*"products"[\s\S]*\}',
                                    block.text
                                )
                                if json_match:
                                    extracted_data = json.loads(json_match.group())
                            except json.JSONDecodeError:
                                pass
                    break

                if response.stop_reason != "tool_use":
                    break

                # Process tool calls
                tool_results = []
                for block in response.content:
                    if block.type != "tool_use":
                        continue

                    tool_result = self._process_tool_call(
                        block.name, block.input, browser_tools
                    )

                    # Handle screenshot specially (it returns image data)
                    if block.name == "screenshot":
                        tool_result_content = [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/png",
                                    "data": tool_result,
                                }
                            }
                        ]
                    else:
                        tool_result_content = [
                            {"type": "text", "text": tool_result}
                        ]

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": tool_result_content,
                    })

                messages.append({"role": "user", "content": tool_results})

            browser.close()

            return {
                "success": extracted_data is not None,
                "steps": step_count,
                "data": extracted_data,
                "url": start_url,
            }


# --- Anti-bot Detection Handling ---

class AntiDetectionContext:
    """
    Creates a browser context that minimizes bot detection signals.
    """

    @staticmethod
    def create_context(playwright):
        browser = playwright.chromium.launch(
            headless=True,
            args=[
                "--no-sandbox",
                "--disable-blink-features=AutomationControlled",
                "--disable-web-security",
                "--disable-features=IsolateOrigins,site-per-process",
            ]
        )

        context = browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/121.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
            permissions=["geolocation"],
        )

        # Override navigator.webdriver
        context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
            Object.defineProperty(navigator, 'plugins', {
                get: () => [1, 2, 3, 4, 5]
            });
            window.chrome = { runtime: {} };
        """)

        return browser, context

    @staticmethod
    def human_delay(min_ms: float = 100, max_ms: float = 500) -> None:
        """Add random human-like delay."""
        import random
        delay = random.uniform(min_ms, max_ms) / 1000
        time.sleep(delay)


# --- Pagination Handler ---

class PaginationHandler:
    """
    Handles various pagination patterns found in the wild.
    """

    NEXT_BUTTON_SELECTORS = [
        "a[aria-label='Next']",
        "button[aria-label='Next']",
        ".pagination-next",
        ".next-page",
        "a:has-text('Next')",
        "button:has-text('Next')",
        "[data-testid='next-page']",
    ]

    def __init__(self, page: Page):
        self.page = page

    def has_next_page(self) -> bool:
        """Check if there's a next page available."""
        for selector in self.NEXT_BUTTON_SELECTORS:
            try:
                element = self.page.locator(selector).first
                if element.is_visible(timeout=2000):
                    return True
            except Exception:
                continue
        return False

    def go_to_next_page(self) -> bool:
        """Navigate to the next page. Returns True if successful."""
        for selector in self.NEXT_BUTTON_SELECTORS:
            try:
                element = self.page.locator(selector).first
                if element.is_visible(timeout=2000):
                    element.click()
                    self.page.wait_for_load_state("networkidle", timeout=10000)
                    return True
            except Exception:
                continue

        # Try infinite scroll detection
        prev_height = self.page.evaluate("document.body.scrollHeight")
        self.page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(2)
        new_height = self.page.evaluate("document.body.scrollHeight")

        if new_height > prev_height:
            return True  # Infinite scroll loaded more content

        return False

    def get_current_page(self) -> int:
        """Try to determine the current page number."""
        try:
            # Look for active pagination element
            active = self.page.locator(
                ".pagination .active, .page-item.active, [aria-current='page']"
            ).first
            text = active.inner_text(timeout=2000)
            return int(text.strip())
        except Exception:
            return 1


# --- Example Usage ---

if __name__ == "__main__":
    import os

    # Example: extract laptop prices from a demo e-commerce site
    agent = BrowserAgent(api_key=os.environ["ANTHROPIC_API_KEY"])

    result = agent.run(
        start_url="https://webscraper.io/test-sites/e-commerce/allinone",
        task=(
            "Search for laptops. For each laptop on the first page, extract: "
            "name, price, and rating. Return the data as a JSON object with "
            "a 'products' array."
        ),
        max_steps=25
    )

    print("\n" + "=" * 60)
    print("EXTRACTION RESULT")
    print("=" * 60)
    print(f"Success: {result['success']}")
    print(f"Steps used: {result['steps']}")
    if result.get("data"):
        print(f"Products found: {len(result['data'].get('products', []))}")
        print(json.dumps(result["data"], indent=2))

The Browser Agent Loop with Error Recovery

Handling Anti-Bot Measures

Modern websites deploy multiple anti-bot techniques. Understanding them is essential for production browser agents.

1. User-agent detection Solution: Use a real, current Chrome user-agent string. Rotate periodically.

2. Headless browser detection Detection methods: navigator.webdriver === true, missing Chrome extensions array, empty plugin list, no screen colors. Solution: Override these properties with add_init_script(). Use headful browser for sensitive sites.

3. Behavioral fingerprinting Detection: Too-fast typing, perfectly spaced clicks, no mouse movement between actions. Solution: Add random delays (100–500ms between actions), occasional mouse movements, varied typing speeds.

4. IP-based rate limiting Solution: Rotate proxies. Residential proxies are most effective (Bright Data, Oxylabs).

5. CAPTCHA challenges Types: reCAPTCHA v2 (image selection), reCAPTCHA v3 (behavioral score), hCaptcha, Cloudflare Turnstile. Solutions:

2captcha.com / anti-captcha.com: paid services that solve CAPTCHAs via human workers (~$1–3 per 1000 challenges, 10–60s latency)
Browser automation that behaves enough like a human to avoid triggering v3
Human-in-the-loop: pause and ask a human to solve it

:::warning Legal and Ethical Considerations

Before deploying a browser agent against any website:

Check robots.txt: https://example.com/robots.txt. Respect Disallow directives.
Read the Terms of Service: Many sites explicitly prohibit automated access. Violating ToS may expose you to legal risk.
Rate limiting: Never hammer a site. Add delays (1–5 seconds between requests). Prefer off-peak hours.
Personal data: GDPR and CCPA restrict automated collection of personal data.
Copyright: Scraped content may be copyrighted. Transformation and analysis is generally acceptable; republication is not.

When in doubt, contact the site owner and ask for an API or data feed. Many will provide one rather than deal with scrapers.

:::

:::danger Session Credentials Security

Never hardcode credentials in your agent code. Use environment variables or a secrets manager:

import os
from dotenv import load_dotenv

load_dotenv()

credentials = {
    "username": os.environ["VENDOR_USERNAME"],
    "password": os.environ["VENDOR_PASSWORD"],
}

Session files saved by SessionManager contain authentication tokens. Treat them as sensitive - do not commit to version control, do not log their contents, and encrypt them at rest.

:::

Interview Questions and Answers

Q: What are the three main approaches to browser automation, and when would you choose each?

A: (1) DOM-based (pure Playwright/Selenium): direct manipulation of HTML elements via CSS selectors. Fast, cheap, reliable for stable sites. Fails when DOM changes or JavaScript dynamically renders content. (2) Vision-based (screenshot + LLM): take screenshots, let the LLM reason about what to click. Handles any interface, adapts to changes, but slow (~2–5s per action) and expensive. (3) Hybrid: use LLM for high-level navigation planning, Playwright for reliable extraction. Best for most production use cases. Choose DOM-only for high-volume stable workflows, vision-only for unknown/changing interfaces, hybrid for most real-world tasks.

Q: How do you handle session management in a browser agent that needs to authenticate?

A: Use Playwright's storage_state() to capture cookies and local storage after successful authentication. Save this to a JSON file. On subsequent runs, load the saved state with browser.new_context(storage_state=path) - this restores the authenticated session without re-logging in. Check if the session is still valid by verifying that authenticated-only elements are present after loading. If the session has expired, re-run the login flow and save a fresh state.

Q: Explain the most common anti-bot detection techniques and how browser agents can handle them.

A: Key techniques: (1) User-agent inspection - use a real, current Chrome UA string. (2) webdriver detection via navigator.webdriver - override with add_init_script. (3) Behavioral fingerprinting (too-fast actions, no mouse movement) - add random delays, occasionally move the mouse. (4) IP rate limiting - use proxy rotation. (5) CAPTCHAs - use solving services like 2captcha for automation, or human-in-the-loop for sensitive workflows. The key principle is to make automated behavior indistinguishable from a real user at each detection layer.

Q: How do you handle pagination in a browser agent without hardcoding selectors?

A: Use a multi-strategy approach: (1) check common pagination selectors (.pagination-next, [aria-label='Next'], button:has-text('Next')), (2) let the LLM identify the "Next" button from a screenshot, (3) detect infinite scroll by comparing document.body.scrollHeight before and after scrolling to the bottom, (4) look for page number indicators to determine current/total pages. Combine LLM flexibility with Playwright's reliable selector engine for the actual click.

Q: What is the difference between using Claude's Computer Use API versus using Claude with a custom Playwright tool set for browser automation?

A: Computer Use API: screenshot-based, Claude operates via pixel coordinates, works in a full desktop environment, can handle any application (browser, desktop apps, etc.), higher latency since every action requires a screenshot round-trip, uses the specialized computer tool. Custom Playwright toolset: Claude uses named tools (navigate, click(selector), extract_text), which are more reliable (no coordinate guessing), lower latency (DOM operations are fast), web-only but much cheaper to operate. For pure web automation, a custom Playwright toolset usually outperforms Computer Use in speed and cost. Computer Use is valuable for non-web interfaces or when DOM access is unreliable.

The Procurement Manager's Problem​

Why Browser Agents, Not Just Playwright?​

Three Architectural Approaches​

Playwright as the Action Layer​

Session Management​

Complete Browser Agent Implementation​

The Browser Agent Loop with Error Recovery​

Handling Anti-Bot Measures​

Interview Questions and Answers​