Web Scraping Agents
When the Scraper Breaks at Midnight
It is 2:47 AM when the Slack alert fires. The competitive intelligence pipeline has failed. Twelve hours of product pricing data is missing from the dashboard that 200 sales reps will open at 8 AM.
The traditional scraper - 3,000 lines of Python with hardcoded CSS selectors, manual cookie handling, and a pile of time.sleep() calls - has broken. Again. The e-commerce site being scraped rolled out a UI update at midnight. Three critical selectors now point to elements that no longer exist. The price container changed from .product-price to [data-testid="pricing-display"]. The pagination button moved inside a new wrapper. The login form now has a dynamic CSRF token with a 60-second expiry.
This happens. It happens regularly. And the traditional response - wake up an engineer, have them inspect the new DOM, update the selectors, redeploy - is expensive and fragile. The next UI update will break it again.
Agent-based scraping solves this differently. Instead of brittle selectors, the agent uses a combination of Playwright's browser control and LLM reasoning to navigate the site as a human would. When the pricing element moves, the agent finds it by visual context and label text rather than CSS class. When pagination changes, the agent looks for "Next" by meaning rather than selector. It is slower and more expensive per page than a direct scraper - but it keeps working after UI changes.
This lesson covers building production-ready scraping agents.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Computer Use Agents demo on the EngineersOfAI Playground - no code required. :::
When to Use Agent-Based Scraping
Not every scraping task needs an agent. Traditional scrapers are faster, cheaper, and simpler when the target is stable. Use agents selectively.
Use traditional scraping (requests + BeautifulSoup) for: static HTML pages, well-documented APIs disguised as websites, RSS feeds, sitemaps.
Use Playwright with CSS selectors for: JavaScript-rendered SPAs with stable DOM structure, login flows to known services where you control the auth credentials.
Use agent-based scraping for: sites that change frequently, multi-step workflows with conditional paths, sites with CAPTCHA or aggressive anti-bot, legacy sites with unpredictable HTML, or any site where selector maintenance is becoming expensive.
Handling JavaScript Rendering
JavaScript-rendered sites are the dominant reason to choose Playwright over requests for modern web scraping.
"""
Why JavaScript rendering matters for scraping.
"""
# This will get an empty product list from a React SPA:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://spa-ecommerce-site.com/laptops")
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all(class_="product-card")
print(f"Found {len(products)} products") # Prints: Found 0 products
# Because React hasn't rendered yet - requests gets the HTML skeleton,
# not the fully rendered page.
# This correctly waits for JavaScript to render the content:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://spa-ecommerce-site.com/laptops")
# Wait for the specific element that proves rendering is complete
page.wait_for_selector(".product-card", timeout=15000)
# NOW we can get the rendered content
products = page.query_selector_all(".product-card")
print(f"Found {len(products)} products") # Prints correct number
browser.close()
Key Playwright patterns for JavaScript-heavy sites:
from playwright.sync_api import Page
def wait_for_render(page: Page, timeout: int = 15000) -> bool:
"""Wait for various signals that JS rendering is complete."""
try:
# Strategy 1: Wait for network to be idle (no pending XHR/fetch)
page.wait_for_load_state("networkidle", timeout=timeout)
return True
except Exception:
pass
try:
# Strategy 2: Wait for a specific element that appears after render
page.wait_for_selector("[data-loaded='true']", timeout=timeout // 2)
return True
except Exception:
pass
try:
# Strategy 3: Wait for React/Vue hydration marker
page.wait_for_function(
"() => document.querySelector('[data-react-hydrated]') !== null",
timeout=timeout // 2
)
return True
except Exception:
pass
# Fallback: just wait a fixed time
page.wait_for_timeout(3000)
return True
def extract_after_scroll(page: Page, item_selector: str) -> list:
"""Extract items that load progressively as you scroll (infinite scroll)."""
all_items = set()
prev_count = 0
for _ in range(10): # Max 10 scroll attempts
# Get current items
items = page.query_selector_all(item_selector)
current_hrefs = set()
for item in items:
href = item.get_attribute("href") or item.inner_text()
current_hrefs.add(href)
all_items.update(current_hrefs)
if len(all_items) == prev_count:
break # No new items loaded
prev_count = len(all_items)
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait for new items to load
return list(all_items)
Login and Session Management
Many valuable data sources require authentication. Agent-based scraping handles login flows that would break traditional scrapers.
"""
session_manager.py
Robust session management for authenticated scraping.
Handles: form login, cookie persistence, session validation, 2FA detection.
"""
import json
import time
from pathlib import Path
from typing import Optional
from playwright.sync_api import sync_playwright, Page, BrowserContext
class AuthenticatedSession:
"""
Manages authenticated browser sessions with persistence.
Avoids re-logging in on every scrape run.
"""
def __init__(self, session_name: str,
session_dir: str = "/tmp/scrape_sessions"):
self.session_name = session_name
self.session_file = Path(session_dir) / f"{session_name}.json"
self.session_file.parent.mkdir(parents=True, exist_ok=True)
def save(self, context: BrowserContext) -> None:
state = context.storage_state()
self.session_file.write_text(json.dumps(state))
print(f"Session saved: {self.session_file}")
def load_context(self, browser, viewport=None):
"""Create a context with saved session, or fresh if none exists."""
kwargs = {"viewport": viewport or {"width": 1280, "height": 720}}
if self.session_file.exists():
kwargs["storage_state"] = str(self.session_file)
print(f"Loading saved session: {self.session_name}")
else:
print("No saved session, starting fresh")
return browser.new_context(**kwargs)
def is_valid(self) -> bool:
"""Check if session file exists and is recent."""
if not self.session_file.exists():
return False
# Sessions older than 12 hours are likely expired
age = time.time() - self.session_file.stat().st_mtime
return age < 43200 # 12 hours
def perform_login(page: Page, username: str, password: str,
login_url: str) -> bool:
"""
Attempt to log into a site.
Returns True if login appears successful.
"""
page.goto(login_url, wait_until="networkidle")
# Common form selectors (try each until one works)
username_selectors = [
"input[type=email]",
"input[type=text][name*=user]",
"input[name=email]",
"input[name=username]",
"input[id*=email]",
"input[id*=user]",
"#username",
"#email",
]
password_selectors = [
"input[type=password]",
"input[name=password]",
"input[id*=password]",
"#password",
]
submit_selectors = [
"button[type=submit]",
"input[type=submit]",
"button:has-text('Sign in')",
"button:has-text('Log in')",
"button:has-text('Login')",
".login-button",
"#login-btn",
]
# Fill username
for sel in username_selectors:
try:
page.fill(sel, username, timeout=3000)
break
except Exception:
continue
# Fill password
for sel in password_selectors:
try:
page.fill(sel, password, timeout=3000)
break
except Exception:
continue
# Submit
for sel in submit_selectors:
try:
page.click(sel, timeout=3000)
page.wait_for_load_state("networkidle", timeout=15000)
break
except Exception:
continue
# Verify login success
# Check for common "you are now logged in" signals
failure_indicators = [
"Invalid credentials",
"Login failed",
"Incorrect password",
"We couldn't find",
"Please try again",
]
page_text = page.inner_text("body")
for indicator in failure_indicators:
if indicator.lower() in page_text.lower():
print(f"Login failed: found '{indicator}' on page")
return False
# Check for successful login indicators
success_indicators = [
page.url != login_url, # Redirected away from login page
page.query_selector(".dashboard") is not None,
page.query_selector("[data-testid='user-menu']") is not None,
]
return any(success_indicators)
def detect_2fa(page: Page) -> Optional[str]:
"""
Detect if a 2FA challenge is present after login.
Returns the type of 2FA or None.
"""
two_fa_indicators = [
("sms", ["Enter the code", "SMS code", "text message"]),
("totp", ["Authenticator app", "TOTP", "6-digit code"]),
("email", ["Check your email", "email code", "confirmation link"]),
("captcha", ["reCAPTCHA", "hCaptcha", "verify you're human"]),
]
page_text = page.inner_text("body").lower()
for fa_type, indicators in two_fa_indicators:
for indicator in indicators:
if indicator.lower() in page_text:
return fa_type
return None
Pagination Strategies
Pagination comes in several forms. A robust scraping agent handles all of them.
"""
pagination_handler.py
Handles multiple pagination patterns:
1. Next/Previous button pagination
2. Page number links (1, 2, 3...)
3. Infinite scroll (content loads as you scroll)
4. Cursor-based pagination (API-like, common in newer apps)
5. Load more button
"""
import time
from typing import Generator, Callable
from playwright.sync_api import Page
class PaginationStrategy:
"""Base class for pagination strategies."""
def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
raise NotImplementedError
class NextButtonPagination(PaginationStrategy):
"""Handles sites with a clickable 'Next' button."""
NEXT_SELECTORS = [
"a[aria-label='Next page']",
"a[aria-label='Next']",
"button[aria-label='Next']",
".next-page:not(.disabled)",
"a.page-link[rel='next']",
"[data-testid='pagination-next']",
]
# Also try text matching
NEXT_TEXT = ["Next", "Next Page", "›", "»", ">"]
def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
page_num = 1
while True:
print(f"Scraping page {page_num}...")
data = extract_fn(page)
yield page_num, data
if not self._click_next(page):
print("No next page found, stopping")
break
page.wait_for_load_state("networkidle", timeout=15000)
page_num += 1
def _click_next(self, page: Page) -> bool:
"""Try to click the Next button. Returns True if successful."""
for sel in self.NEXT_SELECTORS:
try:
btn = page.locator(sel).first
if btn.is_visible(timeout=2000) and btn.is_enabled(timeout=1000):
btn.click()
time.sleep(0.5)
return True
except Exception:
continue
for text in self.NEXT_TEXT:
try:
btn = page.locator(f"text={text}").first
if btn.is_visible(timeout=2000):
btn.click()
time.sleep(0.5)
return True
except Exception:
continue
return False
class InfiniteScrollPagination(PaginationStrategy):
"""Handles infinite scroll - content loads as user scrolls down."""
def __init__(self, max_scrolls: int = 20, scroll_pause: float = 2.0):
self.max_scrolls = max_scrolls
self.scroll_pause = scroll_pause
def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
prev_item_count = 0
scroll_num = 0
while scroll_num < self.max_scrolls:
# Extract current items
data = extract_fn(page)
current_count = len(data) if isinstance(data, list) else 1
yield scroll_num, data
if current_count == prev_item_count:
print("No new items loaded after scroll, stopping")
break
prev_item_count = current_count
# Scroll to bottom
prev_height = page.evaluate("document.body.scrollHeight")
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(self.scroll_pause)
new_height = page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
print("Page height unchanged after scroll, stopping")
break
scroll_num += 1
class LoadMorePagination(PaginationStrategy):
"""Handles pages with a 'Load More' button."""
LOAD_MORE_SELECTORS = [
"button:has-text('Load more')",
"button:has-text('Load More')",
"button:has-text('Show more')",
"a:has-text('Load more')",
"[data-testid='load-more']",
]
def get_pages(self, page: Page, extract_fn: Callable) -> Generator:
load_count = 0
while True:
data = extract_fn(page)
yield load_count, data
# Try to find and click Load More
clicked = False
for sel in self.LOAD_MORE_SELECTORS:
try:
btn = page.locator(sel).first
if btn.is_visible(timeout=3000):
btn.click()
page.wait_for_load_state("networkidle", timeout=10000)
clicked = True
load_count += 1
break
except Exception:
continue
if not clicked:
break
def detect_pagination_type(page: Page) -> str:
"""
Detect which pagination pattern a page uses.
Returns: 'next_button', 'infinite_scroll', 'load_more', 'numbered', 'unknown'
"""
page_text = page.inner_text("body").lower()
# Check for Load More button
load_more_indicators = ["load more", "show more", "view more results"]
for indicator in load_more_indicators:
if indicator in page_text:
return "load_more"
# Check for Next button
next_button_selectors = ["a[rel='next']", "[aria-label*='next']", ".next-page"]
for sel in next_button_selectors:
try:
if page.locator(sel).first.is_visible(timeout=1000):
return "next_button"
except Exception:
pass
# Check for numbered pagination
numbered_selectors = [".pagination", ".page-numbers", "nav[aria-label='pagination']"]
for sel in numbered_selectors:
try:
if page.locator(sel).first.is_visible(timeout=1000):
return "numbered"
except Exception:
pass
# Check for scroll-based loading (look for scroll event listeners)
has_scroll_listener = page.evaluate("""
() => {
const listeners = window.getEventListeners ?
window.getEventListeners(window) : {};
return 'scroll' in listeners;
}
""")
if has_scroll_listener:
return "infinite_scroll"
return "unknown"
Complete Scraping Agent Implementation
Now the full production-ready scraping agent with auth, pagination, extraction, and error handling.
"""
scraping_agent.py
Production-ready web scraping agent using:
- Anthropic Claude for navigation reasoning
- Playwright for browser control
- Pydantic for structured data validation
- Automatic retry and error recovery
"""
import anthropic
import json
import time
import re
from pathlib import Path
from typing import Optional
from pydantic import BaseModel, Field, field_validator
from playwright.sync_api import sync_playwright, Page, TimeoutError as PWTimeout
# --- Data Models ---
class ProductListing(BaseModel):
"""A scraped product listing with validation."""
name: str = Field(min_length=1, max_length=500)
price: float = Field(gt=0)
currency: str = "USD"
url: Optional[str] = None
image_url: Optional[str] = None
rating: Optional[float] = Field(default=None, ge=0, le=5)
review_count: Optional[int] = Field(default=None, ge=0)
availability: str = "unknown"
seller: Optional[str] = None
sku: Optional[str] = None
category: Optional[str] = None
scraped_at: float = Field(default_factory=time.time)
@field_validator("price", mode="before")
@classmethod
def parse_price(cls, v):
if isinstance(v, str):
# Remove currency symbols and commas
clean = re.sub(r"[^\d.]", "", v)
return float(clean) if clean else 0.0
return v
class ScrapingResult(BaseModel):
"""Complete scraping run result."""
products: list[ProductListing]
page_count: int
total_scraped: int
failed_pages: int
duration_seconds: float
url: str
query: Optional[str] = None
# --- LLM-Based Extractor ---
class LLMExtractor:
"""
Uses Claude to extract structured data from page HTML/screenshots.
More resilient to layout changes than CSS selectors.
"""
def __init__(self, api_key: str):
self.client = anthropic.Anthropic(api_key=api_key)
def extract_products(self, page_content: str,
url: str) -> list[ProductListing]:
"""
Extract product listings from page content using Claude.
page_content: HTML source or text content of the page.
"""
# Truncate to avoid token limits (keep most relevant part)
content = page_content[:15000] if len(page_content) > 15000 else page_content
prompt = f"""Extract all product listings from this page content.
URL: {url}
For each product, extract:
- name: full product name
- price: numeric price value
- currency: currency code (USD, EUR, GBP, etc.)
- url: product page URL (absolute if possible)
- rating: numeric rating (0-5) if available
- review_count: number of reviews if available
- availability: "in_stock", "out_of_stock", or "unknown"
- seller: seller name if available
Return a JSON array of products:
[
{{
"name": "Product Name",
"price": 99.99,
"currency": "USD",
"url": "https://...",
"rating": 4.5,
"review_count": 1234,
"availability": "in_stock",
"seller": "Seller Name"
}}
]
If no products found, return an empty array: []
Return ONLY the JSON array, no other text.
Page content:
{content}"""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
messages=[{"role": "user", "content": prompt}]
)
response_text = response.content[0].text.strip()
# Extract JSON from response
json_match = re.search(r'\[[\s\S]*\]', response_text)
if not json_match:
return []
try:
raw_products = json.loads(json_match.group())
validated = []
for p in raw_products:
try:
product = ProductListing(**p)
validated.append(product)
except Exception as e:
print(f" Skipping invalid product: {e}")
return validated
except json.JSONDecodeError as e:
print(f" JSON parse error: {e}")
return []
def decide_next_action(self, page_text: str, task: str,
current_url: str) -> dict:
"""
Ask Claude what to do next given current page state and task.
Returns: {"action": "click"|"navigate"|"done"|"error",
"target": selector or URL, "reason": str}
"""
prompt = f"""You are navigating a website to complete this task: {task}
Current URL: {current_url}
Current page content (first 3000 chars):
{page_text[:3000]}
What should you do next? Choose one action:
- click: click on an element (provide CSS selector or visible text as target)
- navigate: go to a URL (provide full URL as target)
- scroll: scroll down to load more content
- done: task is complete (results have been extracted)
- error: task cannot be completed (explain in reason)
Respond in JSON:
{{
"action": "click",
"target": "selector or URL or 'down' for scroll",
"reason": "brief explanation"
}}"""
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
try:
json_match = re.search(r'\{[\s\S]*\}', response.content[0].text)
if json_match:
return json.loads(json_match.group())
except Exception:
pass
return {"action": "error", "target": None, "reason": "Could not parse decision"}
# --- Main Scraping Agent ---
class ScrapingAgent:
"""
Full scraping agent with:
- Auth handling
- Adaptive navigation
- Pagination
- Structured extraction
- Error recovery
- Rate limiting
"""
def __init__(self, api_key: str, politeness_delay: float = 2.0):
self.api_key = api_key
self.client = anthropic.Anthropic(api_key=api_key)
self.extractor = LLMExtractor(api_key=api_key)
self.politeness_delay = politeness_delay
def _apply_rate_limit(self):
"""Respect rate limits with a human-like delay."""
import random
base = self.politeness_delay
jitter = random.uniform(0, base * 0.5)
time.sleep(base + jitter)
def _get_page_content(self, page: Page) -> str:
"""Get cleaned page text content (not raw HTML)."""
try:
# Get text content (much smaller than raw HTML)
content = page.evaluate("""
() => {
// Remove scripts and styles
const scripts = document.querySelectorAll('script, style');
scripts.forEach(s => s.remove());
return document.body.innerText || document.body.textContent;
}
""")
return content or ""
except Exception:
return ""
def _get_page_html(self, page: Page) -> str:
"""Get page HTML, truncated for the extractor."""
try:
return page.content()
except Exception:
return ""
def scrape(
self,
start_url: str,
search_query: Optional[str] = None,
credentials: Optional[dict] = None,
max_pages: int = 10,
output_file: Optional[str] = None,
) -> ScrapingResult:
"""
Main scraping entry point.
Args:
start_url: Where to start scraping
search_query: If provided, search for this query first
credentials: {'username': ..., 'password': ..., 'login_url': ...}
max_pages: Maximum pages to scrape
output_file: If provided, save results to this JSON file
"""
start_time = time.time()
all_products = []
page_count = 0
failed_pages = 0
with sync_playwright() as playwright:
browser = playwright.chromium.launch(
headless=True,
args=["--no-sandbox", "--disable-blink-features=AutomationControlled"]
)
context = browser.new_context(
viewport={"width": 1280, "height": 720},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36"
),
locale="en-US",
)
# Override webdriver detection
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")
page = context.new_page()
# Step 1: Login if credentials provided
if credentials:
success = self._handle_login(page, credentials)
if not success:
print("Warning: Login may have failed, continuing...")
# Step 2: Navigate to start URL
try:
print(f"Navigating to: {start_url}")
page.goto(start_url, wait_until="domcontentloaded", timeout=30000)
page.wait_for_timeout(2000)
except PWTimeout:
return ScrapingResult(
products=[], page_count=0, total_scraped=0,
failed_pages=1, duration_seconds=time.time() - start_time,
url=start_url, query=search_query
)
# Step 3: Search if query provided
if search_query:
self._handle_search(page, search_query)
# Step 4: Scrape pages with pagination
while page_count < max_pages:
page_count += 1
current_url = page.url
print(f"\nScraping page {page_count}: {current_url[:80]}")
# Apply politeness delay
if page_count > 1:
self._apply_rate_limit()
# Wait for content to render
page.wait_for_timeout(1500)
# Extract products from current page
try:
page_html = self._get_page_html(page)
products = self.extractor.extract_products(page_html, current_url)
print(f" Extracted {len(products)} products")
all_products.extend(products)
except Exception as e:
print(f" Extraction error: {e}")
failed_pages += 1
# Try to go to next page
if not self._go_to_next_page(page):
print(" No more pages")
break
browser.close()
# Build result
result = ScrapingResult(
products=all_products,
page_count=page_count,
total_scraped=len(all_products),
failed_pages=failed_pages,
duration_seconds=time.time() - start_time,
url=start_url,
query=search_query,
)
# Save to file if requested
if output_file:
Path(output_file).write_text(
result.model_dump_json(indent=2)
)
print(f"\nResults saved to: {output_file}")
print(f"\nDone: {len(all_products)} products from {page_count} pages "
f"in {result.duration_seconds:.1f}s")
return result
def _handle_login(self, page: Page, credentials: dict) -> bool:
"""Handle login flow."""
login_url = credentials.get("login_url", "")
if login_url:
page.goto(login_url, wait_until="networkidle", timeout=20000)
# Wait a moment for page to settle
page.wait_for_timeout(1000)
# Try to fill login form
username_filled = False
for sel in ["input[type=email]", "input[type=text]", "input[name*=user]", "#username"]:
try:
page.fill(sel, credentials["username"], timeout=3000)
username_filled = True
break
except Exception:
continue
if not username_filled:
print("Could not find username field")
return False
for sel in ["input[type=password]", "input[name=password]", "#password"]:
try:
page.fill(sel, credentials["password"], timeout=3000)
break
except Exception:
continue
# Submit
for sel in ["button[type=submit]", "input[type=submit]",
"button:has-text('Sign in')", "button:has-text('Log in')"]:
try:
page.click(sel, timeout=3000)
page.wait_for_load_state("networkidle", timeout=15000)
return True
except Exception:
continue
return False
def _handle_search(self, page: Page, query: str) -> bool:
"""Enter a search query."""
search_selectors = [
"input[type=search]",
"input[name=q]",
"input[name=search]",
"input[placeholder*='search' i]",
"input[aria-label*='search' i]",
"#search",
".search-input",
]
for sel in search_selectors:
try:
page.fill(sel, query, timeout=3000)
page.press(sel, "Enter")
page.wait_for_load_state("networkidle", timeout=15000)
print(f"Search submitted: '{query}'")
return True
except Exception:
continue
print(f"Warning: Could not find search field for query: '{query}'")
return False
def _go_to_next_page(self, page: Page) -> bool:
"""Try to navigate to the next page."""
# Try common next page patterns
next_selectors = [
"a[rel='next']",
"[aria-label='Next page']",
"[aria-label='Next']",
"a:has-text('Next')",
"button:has-text('Next')",
".pagination-next:not(.disabled)",
".next:not(.disabled)",
]
for sel in next_selectors:
try:
btn = page.locator(sel).first
if btn.is_visible(timeout=2000) and btn.is_enabled(timeout=1000):
href = btn.get_attribute("href")
if href:
page.goto(href, wait_until="domcontentloaded", timeout=20000)
else:
btn.click()
page.wait_for_load_state("domcontentloaded", timeout=20000)
return True
except Exception:
continue
return False
# --- Example Usage ---
if __name__ == "__main__":
import os
agent = ScrapingAgent(
api_key=os.environ["ANTHROPIC_API_KEY"],
politeness_delay=2.5 # 2.5 second base delay between pages
)
# Scrape laptops from a test e-commerce site
result = agent.scrape(
start_url="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops",
max_pages=3,
output_file="/tmp/laptop_prices.json",
)
print(f"\nExtracted {result.total_scraped} products:")
for p in result.products[:5]: # Show first 5
print(f" {p.name[:60]:<60} ${p.price:.2f}")
if len(result.products) > 5:
print(f" ... and {len(result.products) - 5} more")
:::warning Respect robots.txt
Before any scraping project, check robots.txt:
import urllib.robotparser
def can_scrape(base_url: str, path: str,
user_agent: str = "*") -> bool:
"""Check robots.txt before scraping."""
rp = urllib.robotparser.RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
rp.read()
return rp.can_fetch(user_agent, f"{base_url}{path}")
# Usage
if not can_scrape("https://example.com", "/products"):
print("robots.txt disallows this path")
else:
# Proceed with scraping
pass
Disregarding robots.txt may violate the site's Terms of Service and could lead to IP bans, legal action (in some jurisdictions), or loss of access to the data source permanently.
:::
:::danger Rate Limiting and IP Bans
Aggressive scraping without rate limiting will trigger IP bans. Production guidelines:
- Minimum 1–3 seconds between page requests
- Add random jitter to delays (±50%) to avoid fingerprinting by exact timing
- Respect
Retry-Afterheaders if you receive 429 (Too Many Requests) - Rotate proxies if operating at scale (residential proxies for anti-bot, datacenter for open sites)
- Never scrape at full speed during business hours if the target site is a small business
An IP ban from a critical data source can cripple a business workflow. Treat it as seriously as a production database connection.
:::
Interview Questions and Answers
Q: When should you use an agent for web scraping instead of a traditional scraper with CSS selectors?
A: Use agents when: (1) the site uses heavy JavaScript rendering and CSS selectors break after React/Vue re-renders, (2) the site requires login and session management that breaks traditional cookie handling, (3) the site changes its layout frequently making selector maintenance expensive, (4) the scraping workflow is conditional (different paths for different product categories), or (5) the site employs aggressive anti-bot requiring adaptive human-like behavior. Use traditional scrapers for static HTML, stable DOM structures, or high-volume extractions where LLM costs would be prohibitive.
Q: How do you handle session expiry in a long-running scraping agent?
A: Implement session validation before each scraping run: check for authenticated-only elements (user menu, account icon) after loading the saved session. If the session appears expired, re-run the login flow. Use Playwright's storage_state() to save cookies after successful login, and reload with storage_state=path on the next run. For very long scraping runs, implement periodic session checks: after N pages, request a page that requires authentication and verify the response looks correct (not a login redirect).
Q: Describe your approach to handling different pagination patterns in a scraping agent.
A: Use a multi-strategy detection and execution approach: (1) check for <a rel="next"> (most reliable indicator), (2) look for common Next button selectors by CSS class or ARIA label, (3) try text matching for "Next", "›", etc., (4) detect infinite scroll by comparing document.body.scrollHeight before and after scrolling, (5) detect Load More buttons. For the agent, after extracting each page, try each strategy in order and stop when any succeeds. If none succeed, stop and report completion. Use LLM reasoning as a fallback when CSS-selector strategies all fail.
Q: How do you validate scraped data quality in a production scraping pipeline?
A: Use Pydantic models for structural validation (type checking, value ranges, required fields). Beyond Pydantic: (1) validate numeric fields against expected ranges (prices within reasonable bounds, ratings between 0 and 5), (2) check URL validity for product URLs, (3) compare extracted count against expected count (if pagination says "1,234 results" but you extracted 50, investigate), (4) spot-check a random sample of records against manual verification, (5) track extraction rate over time - a sudden drop in products-per-page indicates a scraper break. Alert on deviations greater than 20% from historical average.
Q: What are the legal and ethical constraints on web scraping, and how do they affect architectural decisions?
A: Key constraints: (1) robots.txt - check before scraping, respect Disallow directives; (2) Terms of Service - many sites explicitly prohibit automated access; (3) GDPR/CCPA - personal data of EU/CA residents has specific restrictions on collection and storage; (4) Copyright - scraped content may be copyrighted, transformation for analysis is generally acceptable but republication is not; (5) Rate limiting - aggressive scraping can constitute DoS, especially against smaller sites. Architectural implications: always check robots.txt programmatically before scraping any path; implement configurable rate limiting; avoid storing personal data beyond what's needed for the task; document the legal basis for each scraping operation; prefer official data exports or APIs when offered.
