What is web scraping agent?

Agent-based web scraping - handling dynamic JavaScript rendering, login flows, multi-page pagination, structured data extraction, and anti-detection techniques.

How does Playwright scraping work in practice?

Web Scraping Agents covers web scraping agent, Playwright scraping, dynamic rendering from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/computer-use-agents/web-scraping-agents

What is the difference between web scraping agent and dynamic rendering?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/computer-use-agents/web-scraping-agents

Web Scraping Agents

When the Scraper Breaks at Midnight

It is 2:47 AM when the Slack alert fires. The competitive intelligence pipeline has failed. Twelve hours of product pricing data is missing from the dashboard that 200 sales reps will open at 8 AM.

The traditional scraper - 3,000 lines of Python with hardcoded CSS selectors, manual cookie handling, and a pile of time.sleep() calls - has broken. Again. The e-commerce site being scraped rolled out a UI update at midnight. Three critical selectors now point to elements that no longer exist. The price container changed from .product-price to [data-testid="pricing-display"]. The pagination button moved inside a new wrapper. The login form now has a dynamic CSRF token with a 60-second expiry.

This happens. It happens regularly. And the traditional response - wake up an engineer, have them inspect the new DOM, update the selectors, redeploy - is expensive and fragile. The next UI update will break it again.

Agent-based scraping solves this differently. Instead of brittle selectors, the agent uses a combination of Playwright's browser control and LLM reasoning to navigate the site as a human would. When the pricing element moves, the agent finds it by visual context and label text rather than CSS class. When pagination changes, the agent looks for "Next" by meaning rather than selector. It is slower and more expensive per page than a direct scraper - but it keeps working after UI changes.

This lesson covers building production-ready scraping agents.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Computer Use Agents demo on the EngineersOfAI Playground - no code required. :::

When to Use Agent-Based Scraping

Not every scraping task needs an agent. Traditional scrapers are faster, cheaper, and simpler when the target is stable. Use agents selectively.

Use traditional scraping (requests + BeautifulSoup) for: static HTML pages, well-documented APIs disguised as websites, RSS feeds, sitemaps.

Use Playwright with CSS selectors for: JavaScript-rendered SPAs with stable DOM structure, login flows to known services where you control the auth credentials.

Use agent-based scraping for: sites that change frequently, multi-step workflows with conditional paths, sites with CAPTCHA or aggressive anti-bot, legacy sites with unpredictable HTML, or any site where selector maintenance is becoming expensive.

Handling JavaScript Rendering

JavaScript-rendered sites are the dominant reason to choose Playwright over requests for modern web scraping.

"""
Why JavaScript rendering matters for scraping.
"""

# This will get an empty product list from a React SPA:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://spa-ecommerce-site.com/laptops")
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all(class_="product-card")
print(f"Found {len(products)} products")  # Prints: Found 0 products
# Because React hasn't rendered yet - requests gets the HTML skeleton,
# not the fully rendered page.

# This correctly waits for JavaScript to render the content:
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://spa-ecommerce-site.com/laptops")

    # Wait for the specific element that proves rendering is complete
    page.wait_for_selector(".product-card", timeout=15000)

    # NOW we can get the rendered content
    products = page.query_selector_all(".product-card")
    print(f"Found {len(products)} products")  # Prints correct number

    browser.close()

Key Playwright patterns for JavaScript-heavy sites:

from playwright.sync_api import Page

def wait_for_render(page: Page, timeout: int = 15000) -> bool:
    """Wait for various signals that JS rendering is complete."""
    try:
        # Strategy 1: Wait for network to be idle (no pending XHR/fetch)
        page.wait_for_load_state("networkidle", timeout=timeout)
        return True
    except Exception:
        pass

    try:
        # Strategy 2: Wait for a specific element that appears after render
        page.wait_for_selector("[data-loaded='true']", timeout=timeout // 2)
        return True
    except Exception:
        pass

    try:
        # Strategy 3: Wait for React/Vue hydration marker
        page.wait_for_function(
            "() => document.querySelector('[data-react-hydrated]') !== null",
            timeout=timeout // 2
        )
        return True
    except Exception:
        pass

    # Fallback: just wait a fixed time
    page.wait_for_timeout(3000)
    return True


def extract_after_scroll(page: Page, item_selector: str) -> list:
    """Extract items that load progressively as you scroll (infinite scroll)."""
    all_items = set()
    prev_count = 0

    for _ in range(10):  # Max 10 scroll attempts
        # Get current items
        items = page.query_selector_all(item_selector)
        current_hrefs = set()

        for item in items:
            href = item.get_attribute("href") or item.inner_text()
            current_hrefs.add(href)

        all_items.update(current_hrefs)

        if len(all_items) == prev_count:
            break  # No new items loaded

        prev_count = len(all_items)

        # Scroll to bottom
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)  # Wait for new items to load

    return list(all_items)

Many valuable data sources require authentication. Agent-based scraping handles login flows that would break traditional scrapers.

"""
session_manager.py

Robust session management for authenticated scraping.
Handles: form login, cookie persistence, session validation, 2FA detection.
"""

import json
import time
from pathlib import Path
from typing import Optional
from playwright.sync_api import sync_playwright, Page, BrowserContext


class AuthenticatedSession:
    """
    Manages authenticated browser sessions with persistence.
    Avoids re-logging in on every scrape run.
    """

    def __init__(self, session_name: str,
                 session_dir: str = "/tmp/scrape_sessions"):
        self.session_name = session_name
        self.session_file = Path(session_dir) / f"{session_name}.json"
        self.session_file.parent.mkdir(parents=True, exist_ok=True)

    def save(self, context: BrowserContext) -> None:
        state = context.storage_state()
        self.session_file.write_text(json.dumps(state))
        print(f"Session saved: {self.session_file}")

    def load_context(self, browser, viewport=None):
        """Create a context with saved session, or fresh if none exists."""
        kwargs = {"viewport": viewport or {"width": 1280, "height": 720}}

        if self.session_file.exists():
            kwargs["storage_state"] = str(self.session_file)
            print(f"Loading saved session: {self.session_name}")
        else:
            print("No saved session, starting fresh")

        return browser.new_context(**kwargs)

    def is_valid(self) -> bool:
        """Check if session file exists and is recent."""
        if not self.session_file.exists():
            return False
        # Sessions older than 12 hours are likely expired
        age = time.time() - self.session_file.stat().st_mtime
        return age < 43200  # 12 hours


def perform_login(page: Page, username: str, password: str,
                  login_url: str) -> bool:
    """
    Attempt to log into a site.
    Returns True if login appears successful.
    """
    page.goto(login_url, wait_until="networkidle")

    # Common form selectors (try each until one works)
    username_selectors = [
        "input[type=email]",
        "input[type=text][name*=user]",
        "input[name=email]",
        "input[name=username]",
        "input[id*=email]",
        "input[id*=user]",
        "#username",
        "#email",
    ]

    password_selectors = [
        "input[type=password]",
        "input[name=password]",
        "input[id*=password]",
        "#password",
    ]

    submit_selectors = [
        "button[type=submit]",
        "input[type=submit]",
        "button:has-text('Sign in')",
        "button:has-text('Log in')",
        "button:has-text('Login')",
        ".login-button",
        "#login-btn",
    ]

    # Fill username
    for sel in username_selectors:
        try:
            page.fill(sel, username, timeout=3000)
            break
        except Exception:
            continue

    # Fill password
    for sel in password_selectors:
        try:
            page.fill(sel, password, timeout=3000)
            break
        except Exception:
            continue

    # Submit
    for sel in submit_selectors:
        try:
            page.click(sel, timeout=3000)
            page.wait_for_load_state("networkidle", timeout=15000)
            break
        except Exception:
            continue

    # Verify login success
    # Check for common "you are now logged in" signals
    failure_indicators = [
        "Invalid credentials",
        "Login failed",
        "Incorrect password",
        "We couldn't find",
        "Please try again",
    ]

    page_text = page.inner_text("body")
    for indicator in failure_indicators:
        if indicator.lower() in page_text.lower():
            print(f"Login failed: found '{indicator}' on page")
            return False

    # Check for successful login indicators
    success_indicators = [
        page.url != login_url,  # Redirected away from login page
        page.query_selector(".dashboard") is not None,
        page.query_selector("[data-testid='user-menu']") is not None,
    ]

    return any(success_indicators)


def detect_2fa(page: Page) -> Optional[str]:
    """
    Detect if a 2FA challenge is present after login.
    Returns the type of 2FA or None.
    """
    two_fa_indicators = [
        ("sms", ["Enter the code", "SMS code", "text message"]),
        ("totp", ["Authenticator app", "TOTP", "6-digit code"]),
        ("email", ["Check your email", "email code", "confirmation link"]),
        ("captcha", ["reCAPTCHA", "hCaptcha", "verify you're human"]),
    ]

    page_text = page.inner_text("body").lower()

    for fa_type, indicators in two_fa_indicators:
        for indicator in indicators:
            if indicator.lower() in page_text:
                return fa_type

    return None

Pagination Strategies

Pagination comes in several forms. A robust scraping agent handles all of them.

"""
pagination_handler.py

Handles multiple pagination patterns:
1. Next/Previous button pagination
2. Page number links (1, 2, 3...)
3. Infinite scroll (content loads as you scroll)
4. Cursor-based pagination (API-like, common in newer apps)
5. Load more button
"""

import time
from typing import Generator, Callable
from playwright.sync_api import Page


class PaginationStrategy:
    """Base class for pagination strategies."""

    def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
        raise NotImplementedError


class NextButtonPagination(PaginationStrategy):
    """Handles sites with a clickable 'Next' button."""

    NEXT_SELECTORS = [
        "a[aria-label='Next page']",
        "a[aria-label='Next']",
        "button[aria-label='Next']",
        ".next-page:not(.disabled)",
        "a.page-link[rel='next']",
        "[data-testid='pagination-next']",
    ]
    # Also try text matching
    NEXT_TEXT = ["Next", "Next Page", "›", "»", ">"]

    def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
        page_num = 1
        while True:
            print(f"Scraping page {page_num}...")
            data = extract_fn(page)
            yield page_num, data

            if not self._click_next(page):
                print("No next page found, stopping")
                break

            page.wait_for_load_state("networkidle", timeout=15000)
            page_num += 1

    def _click_next(self, page: Page) -> bool:
        """Try to click the Next button. Returns True if successful."""
        for sel in self.NEXT_SELECTORS:
            try:
                btn = page.locator(sel).first
                if btn.is_visible(timeout=2000) and btn.is_enabled(timeout=1000):
                    btn.click()
                    time.sleep(0.5)
                    return True
            except Exception:
                continue

        for text in self.NEXT_TEXT:
            try:
                btn = page.locator(f"text={text}").first
                if btn.is_visible(timeout=2000):
                    btn.click()
                    time.sleep(0.5)
                    return True
            except Exception:
                continue

        return False


class InfiniteScrollPagination(PaginationStrategy):
    """Handles infinite scroll - content loads as user scrolls down."""

    def __init__(self, max_scrolls: int = 20, scroll_pause: float = 2.0):
        self.max_scrolls = max_scrolls
        self.scroll_pause = scroll_pause

    def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
        prev_item_count = 0
        scroll_num = 0

        while scroll_num < self.max_scrolls:
            # Extract current items
            data = extract_fn(page)
            current_count = len(data) if isinstance(data, list) else 1
            yield scroll_num, data

            if current_count == prev_item_count:
                print("No new items loaded after scroll, stopping")
                break

            prev_item_count = current_count

            # Scroll to bottom
            prev_height = page.evaluate("document.body.scrollHeight")
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(self.scroll_pause)
            new_height = page.evaluate("document.body.scrollHeight")

            if new_height == prev_height:
                print("Page height unchanged after scroll, stopping")
                break

            scroll_num += 1


class LoadMorePagination(PaginationStrategy):
    """Handles pages with a 'Load More' button."""

    LOAD_MORE_SELECTORS = [
        "button:has-text('Load more')",
        "button:has-text('Load More')",
        "button:has-text('Show more')",
        "a:has-text('Load more')",
        "[data-testid='load-more']",
    ]

    def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
        load_count = 0
        while True:
            data = extract_fn(page)
            yield load_count, data

            # Try to find and click Load More
            clicked = False
            for sel in self.LOAD_MORE_SELECTORS:
                try:
                    btn = page.locator(sel).first
                    if btn.is_visible(timeout=3000):
                        btn.click()
                        page.wait_for_load_state("networkidle", timeout=10000)
                        clicked = True
                        load_count += 1
                        break
                except Exception:
                    continue

            if not clicked:
                break


def detect_pagination_type(page: Page) -> str:
    """
    Detect which pagination pattern a page uses.
    Returns: 'next_button', 'infinite_scroll', 'load_more', 'numbered', 'unknown'
    """
    page_text = page.inner_text("body").lower()

    # Check for Load More button
    load_more_indicators = ["load more", "show more", "view more results"]
    for indicator in load_more_indicators:
        if indicator in page_text:
            return "load_more"

    # Check for Next button
    next_button_selectors = ["a[rel='next']", "[aria-label*='next']", ".next-page"]
    for sel in next_button_selectors:
        try:
            if page.locator(sel).first.is_visible(timeout=1000):
                return "next_button"
        except Exception:
            pass

    # Check for numbered pagination
    numbered_selectors = [".pagination", ".page-numbers", "nav[aria-label='pagination']"]
    for sel in numbered_selectors:
        try:
            if page.locator(sel).first.is_visible(timeout=1000):
                return "numbered"
        except Exception:
            pass

    # Check for scroll-based loading (look for scroll event listeners)
    has_scroll_listener = page.evaluate("""
        () => {
            const listeners = window.getEventListeners ?
                window.getEventListeners(window) : {};
            return 'scroll' in listeners;
        }
    """)
    if has_scroll_listener:
        return "infinite_scroll"

    return "unknown"

Complete Scraping Agent Implementation

Now the full production-ready scraping agent with auth, pagination, extraction, and error handling.

"""
scraping_agent.py

Production-ready web scraping agent using:
- Anthropic Claude for navigation reasoning
- Playwright for browser control
- Pydantic for structured data validation
- Automatic retry and error recovery
"""

import anthropic
import json
import time
import re
from pathlib import Path
from typing import Optional
from pydantic import BaseModel, Field, field_validator
from playwright.sync_api import sync_playwright, Page, TimeoutError as PWTimeout


# --- Data Models ---

class ProductListing(BaseModel):
    """A scraped product listing with validation."""
    name: str = Field(min_length=1, max_length=500)
    price: float = Field(gt=0)
    currency: str = "USD"
    url: Optional[str] = None
    image_url: Optional[str] = None
    rating: Optional[float] = Field(default=None, ge=0, le=5)
    review_count: Optional[int] = Field(default=None, ge=0)
    availability: str = "unknown"
    seller: Optional[str] = None
    sku: Optional[str] = None
    category: Optional[str] = None
    scraped_at: float = Field(default_factory=time.time)

    @field_validator("price", mode="before")
    @classmethod
    def parse_price(cls, v):
        if isinstance(v, str):
            # Remove currency symbols and commas
            clean = re.sub(r"[^\d.]", "", v)
            return float(clean) if clean else 0.0
        return v


class ScrapingResult(BaseModel):
    """Complete scraping run result."""
    products: list[ProductListing]
    page_count: int
    total_scraped: int
    failed_pages: int
    duration_seconds: float
    url: str
    query: Optional[str] = None


# --- LLM-Based Extractor ---

class LLMExtractor:
    """
    Uses Claude to extract structured data from page HTML/screenshots.
    More resilient to layout changes than CSS selectors.
    """

    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)

    def extract_products(self, page_content: str,
                         url: str) -> list[ProductListing]:
        """
        Extract product listings from page content using Claude.
        page_content: HTML source or text content of the page.
        """
        # Truncate to avoid token limits (keep most relevant part)
        content = page_content[:15000] if len(page_content) > 15000 else page_content

        prompt = f"""Extract all product listings from this page content.
URL: {url}

For each product, extract:
- name: full product name
- price: numeric price value
- currency: currency code (USD, EUR, GBP, etc.)
- url: product page URL (absolute if possible)
- rating: numeric rating (0-5) if available
- review_count: number of reviews if available
- availability: "in_stock", "out_of_stock", or "unknown"
- seller: seller name if available

Return a JSON array of products:
[
  {{
    "name": "Product Name",
    "price": 99.99,
    "currency": "USD",
    "url": "https://...",
    "rating": 4.5,
    "review_count": 1234,
    "availability": "in_stock",
    "seller": "Seller Name"
  }}
]

If no products found, return an empty array: []
Return ONLY the JSON array, no other text.

Page content:
{content}"""

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4000,
            messages=[{"role": "user", "content": prompt}]
        )

        response_text = response.content[0].text.strip()

        # Extract JSON from response
        json_match = re.search(r'\[[\s\S]*\]', response_text)
        if not json_match:
            return []

        try:
            raw_products = json.loads(json_match.group())
            validated = []
            for p in raw_products:
                try:
                    product = ProductListing(**p)
                    validated.append(product)
                except Exception as e:
                    print(f"  Skipping invalid product: {e}")
            return validated
        except json.JSONDecodeError as e:
            print(f"  JSON parse error: {e}")
            return []

    def decide_next_action(self, page_text: str, task: str,
                           current_url: str) -> dict:
        """
        Ask Claude what to do next given current page state and task.
        Returns: {"action": "click"|"navigate"|"done"|"error",
                  "target": selector or URL, "reason": str}
        """
        prompt = f"""You are navigating a website to complete this task: {task}

Current URL: {current_url}

Current page content (first 3000 chars):
{page_text[:3000]}

What should you do next? Choose one action:
- click: click on an element (provide CSS selector or visible text as target)
- navigate: go to a URL (provide full URL as target)
- scroll: scroll down to load more content
- done: task is complete (results have been extracted)
- error: task cannot be completed (explain in reason)

Respond in JSON:
{{
  "action": "click",
  "target": "selector or URL or 'down' for scroll",
  "reason": "brief explanation"
}}"""

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}]
        )

        try:
            json_match = re.search(r'\{[\s\S]*\}', response.content[0].text)
            if json_match:
                return json.loads(json_match.group())
        except Exception:
            pass

        return {"action": "error", "target": None, "reason": "Could not parse decision"}


# --- Main Scraping Agent ---

class ScrapingAgent:
    """
    Full scraping agent with:
    - Auth handling
    - Adaptive navigation
    - Pagination
    - Structured extraction
    - Error recovery
    - Rate limiting
    """

    def __init__(self, api_key: str, politeness_delay: float = 2.0):
        self.api_key = api_key
        self.client = anthropic.Anthropic(api_key=api_key)
        self.extractor = LLMExtractor(api_key=api_key)
        self.politeness_delay = politeness_delay

    def _apply_rate_limit(self):
        """Respect rate limits with a human-like delay."""
        import random
        base = self.politeness_delay
        jitter = random.uniform(0, base * 0.5)
        time.sleep(base + jitter)

    def _get_page_content(self, page: Page) -> str:
        """Get cleaned page text content (not raw HTML)."""
        try:
            # Get text content (much smaller than raw HTML)
            content = page.evaluate("""
                () => {
                    // Remove scripts and styles
                    const scripts = document.querySelectorAll('script, style');
                    scripts.forEach(s => s.remove());
                    return document.body.innerText || document.body.textContent;
                }
            """)
            return content or ""
        except Exception:
            return ""

    def _get_page_html(self, page: Page) -> str:
        """Get page HTML, truncated for the extractor."""
        try:
            return page.content()
        except Exception:
            return ""

    def scrape(
        self,
        start_url: str,
        search_query: Optional[str] = None,
        credentials: Optional[dict] = None,
        max_pages: int = 10,
        output_file: Optional[str] = None,
    ) -> ScrapingResult:
        """
        Main scraping entry point.

        Args:
            start_url: Where to start scraping
            search_query: If provided, search for this query first
            credentials: {'username': ..., 'password': ..., 'login_url': ...}
            max_pages: Maximum pages to scrape
            output_file: If provided, save results to this JSON file
        """
        start_time = time.time()
        all_products = []
        page_count = 0
        failed_pages = 0

        with sync_playwright() as playwright:
            browser = playwright.chromium.launch(
                headless=True,
                args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
            )

            context = browser.new_context(
                viewport={"width": 1280, "height": 720},
                user_agent=(
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/121.0.0.0 Safari/537.36"
                ),
                locale="en-US",
            )

            # Override webdriver detection
            context.add_init_script("""
                Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            """)

            page = context.new_page()

            # Step 1: Login if credentials provided
            if credentials:
                success = self._handle_login(page, credentials)
                if not success:
                    print("Warning: Login may have failed, continuing...")

            # Step 2: Navigate to start URL
            try:
                print(f"Navigating to: {start_url}")
                page.goto(start_url, wait_until="domcontentloaded", timeout=30000)
                page.wait_for_timeout(2000)
            except PWTimeout:
                return ScrapingResult(
                    products=[], page_count=0, total_scraped=0,
                    failed_pages=1, duration_seconds=time.time() - start_time,
                    url=start_url, query=search_query
                )

            # Step 3: Search if query provided
            if search_query:
                self._handle_search(page, search_query)

            # Step 4: Scrape pages with pagination
            while page_count < max_pages:
                page_count += 1
                current_url = page.url
                print(f"\nScraping page {page_count}: {current_url[:80]}")

                # Apply politeness delay
                if page_count > 1:
                    self._apply_rate_limit()

                # Wait for content to render
                page.wait_for_timeout(1500)

                # Extract products from current page
                try:
                    page_html = self._get_page_html(page)
                    products = self.extractor.extract_products(page_html, current_url)
                    print(f"  Extracted {len(products)} products")
                    all_products.extend(products)
                except Exception as e:
                    print(f"  Extraction error: {e}")
                    failed_pages += 1

                # Try to go to next page
                if not self._go_to_next_page(page):
                    print("  No more pages")
                    break

            browser.close()

        # Build result
        result = ScrapingResult(
            products=all_products,
            page_count=page_count,
            total_scraped=len(all_products),
            failed_pages=failed_pages,
            duration_seconds=time.time() - start_time,
            url=start_url,
            query=search_query,
        )

        # Save to file if requested
        if output_file:
            Path(output_file).write_text(
                result.model_dump_json(indent=2)
            )
            print(f"\nResults saved to: {output_file}")

        print(f"\nDone: {len(all_products)} products from {page_count} pages "
              f"in {result.duration_seconds:.1f}s")

        return result

    def _handle_login(self, page: Page, credentials: dict) -> bool:
        """Handle login flow."""
        login_url = credentials.get("login_url", "")
        if login_url:
            page.goto(login_url, wait_until="networkidle", timeout=20000)

        # Wait a moment for page to settle
        page.wait_for_timeout(1000)

        # Try to fill login form
        username_filled = False
        for sel in ["input[type=email]", "input[type=text]", "input[name*=user]", "#username"]:
            try:
                page.fill(sel, credentials["username"], timeout=3000)
                username_filled = True
                break
            except Exception:
                continue

        if not username_filled:
            print("Could not find username field")
            return False

        for sel in ["input[type=password]", "input[name=password]", "#password"]:
            try:
                page.fill(sel, credentials["password"], timeout=3000)
                break
            except Exception:
                continue

        # Submit
        for sel in ["button[type=submit]", "input[type=submit]",
                    "button:has-text('Sign in')", "button:has-text('Log in')"]:
            try:
                page.click(sel, timeout=3000)
                page.wait_for_load_state("networkidle", timeout=15000)
                return True
            except Exception:
                continue

        return False

    def _handle_search(self, page: Page, query: str) -> bool:
        """Enter a search query."""
        search_selectors = [
            "input[type=search]",
            "input[name=q]",
            "input[name=search]",
            "input[placeholder*='search' i]",
            "input[aria-label*='search' i]",
            "#search",
            ".search-input",
        ]

        for sel in search_selectors:
            try:
                page.fill(sel, query, timeout=3000)
                page.press(sel, "Enter")
                page.wait_for_load_state("networkidle", timeout=15000)
                print(f"Search submitted: '{query}'")
                return True
            except Exception:
                continue

        print(f"Warning: Could not find search field for query: '{query}'")
        return False

    def _go_to_next_page(self, page: Page) -> bool:
        """Try to navigate to the next page."""
        # Try common next page patterns
        next_selectors = [
            "a[rel='next']",
            "[aria-label='Next page']",
            "[aria-label='Next']",
            "a:has-text('Next')",
            "button:has-text('Next')",
            ".pagination-next:not(.disabled)",
            ".next:not(.disabled)",
        ]

        for sel in next_selectors:
            try:
                btn = page.locator(sel).first
                if btn.is_visible(timeout=2000) and btn.is_enabled(timeout=1000):
                    href = btn.get_attribute("href")
                    if href:
                        page.goto(href, wait_until="domcontentloaded", timeout=20000)
                    else:
                        btn.click()
                        page.wait_for_load_state("domcontentloaded", timeout=20000)
                    return True
            except Exception:
                continue

        return False


# --- Example Usage ---

if __name__ == "__main__":
    import os

    agent = ScrapingAgent(
        api_key=os.environ["ANTHROPIC_API_KEY"],
        politeness_delay=2.5  # 2.5 second base delay between pages
    )

    # Scrape laptops from a test e-commerce site
    result = agent.scrape(
        start_url="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops",
        max_pages=3,
        output_file="/tmp/laptop_prices.json",
    )

    print(f"\nExtracted {result.total_scraped} products:")
    for p in result.products[:5]:  # Show first 5
        print(f"  {p.name[:60]:<60} ${p.price:.2f}")

    if len(result.products) > 5:
        print(f"  ... and {len(result.products) - 5} more")

:::warning Respect robots.txt

Before any scraping project, check robots.txt:

import urllib.robotparser

def can_scrape(base_url: str, path: str,
               user_agent: str = "*") -> bool:
    """Check robots.txt before scraping."""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{base_url}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, f"{base_url}{path}")

# Usage
if not can_scrape("https://example.com", "/products"):
    print("robots.txt disallows this path")
else:
    # Proceed with scraping
    pass

Disregarding robots.txt may violate the site's Terms of Service and could lead to IP bans, legal action (in some jurisdictions), or loss of access to the data source permanently.

:::

:::danger Rate Limiting and IP Bans

Aggressive scraping without rate limiting will trigger IP bans. Production guidelines:

Minimum 1–3 seconds between page requests
Add random jitter to delays (±50%) to avoid fingerprinting by exact timing
Respect Retry-After headers if you receive 429 (Too Many Requests)
Rotate proxies if operating at scale (residential proxies for anti-bot, datacenter for open sites)
Never scrape at full speed during business hours if the target site is a small business

An IP ban from a critical data source can cripple a business workflow. Treat it as seriously as a production database connection.

:::

Interview Questions and Answers

Q: When should you use an agent for web scraping instead of a traditional scraper with CSS selectors?

A: Use agents when: (1) the site uses heavy JavaScript rendering and CSS selectors break after React/Vue re-renders, (2) the site requires login and session management that breaks traditional cookie handling, (3) the site changes its layout frequently making selector maintenance expensive, (4) the scraping workflow is conditional (different paths for different product categories), or (5) the site employs aggressive anti-bot requiring adaptive human-like behavior. Use traditional scrapers for static HTML, stable DOM structures, or high-volume extractions where LLM costs would be prohibitive.

Q: How do you handle session expiry in a long-running scraping agent?

A: Implement session validation before each scraping run: check for authenticated-only elements (user menu, account icon) after loading the saved session. If the session appears expired, re-run the login flow. Use Playwright's storage_state() to save cookies after successful login, and reload with storage_state=path on the next run. For very long scraping runs, implement periodic session checks: after N pages, request a page that requires authentication and verify the response looks correct (not a login redirect).

Q: Describe your approach to handling different pagination patterns in a scraping agent.

A: Use a multi-strategy detection and execution approach: (1) check for <a rel="next"> (most reliable indicator), (2) look for common Next button selectors by CSS class or ARIA label, (3) try text matching for "Next", "›", etc., (4) detect infinite scroll by comparing document.body.scrollHeight before and after scrolling, (5) detect Load More buttons. For the agent, after extracting each page, try each strategy in order and stop when any succeeds. If none succeed, stop and report completion. Use LLM reasoning as a fallback when CSS-selector strategies all fail.

Q: How do you validate scraped data quality in a production scraping pipeline?

A: Use Pydantic models for structural validation (type checking, value ranges, required fields). Beyond Pydantic: (1) validate numeric fields against expected ranges (prices within reasonable bounds, ratings between 0 and 5), (2) check URL validity for product URLs, (3) compare extracted count against expected count (if pagination says "1,234 results" but you extracted 50, investigate), (4) spot-check a random sample of records against manual verification, (5) track extraction rate over time - a sudden drop in products-per-page indicates a scraper break. Alert on deviations greater than 20% from historical average.

Q: What are the legal and ethical constraints on web scraping, and how do they affect architectural decisions?

A: Key constraints: (1) robots.txt - check before scraping, respect Disallow directives; (2) Terms of Service - many sites explicitly prohibit automated access; (3) GDPR/CCPA - personal data of EU/CA residents has specific restrictions on collection and storage; (4) Copyright - scraped content may be copyrighted, transformation for analysis is generally acceptable but republication is not; (5) Rate limiting - aggressive scraping can constitute DoS, especially against smaller sites. Architectural implications: always check robots.txt programmatically before scraping any path; implement configurable rate limiting; avoid storing personal data beyond what's needed for the task; document the legal basis for each scraping operation; prefer official data exports or APIs when offered.

When the Scraper Breaks at Midnight​

When to Use Agent-Based Scraping​

Handling JavaScript Rendering​

Login and Session Management​

Pagination Strategies​

Complete Scraping Agent Implementation​

Interview Questions and Answers​