Module 03: Computer Use Agents

What This Module Is About

In October 2024, Anthropic released something that felt genuinely different: Claude 3.5 Sonnet could take a screenshot of a computer screen, understand what it saw, and then click buttons, type text, and navigate interfaces - just like a human operator.

This is computer use. And it changes what automation can do.

Until computer use, automation required APIs. If a service did not expose an API, or if you needed to interact with a legacy desktop app, or if the interface changed frequently and broke your brittle CSS selectors - you were stuck. You needed a human in the loop.

Computer use agents remove that requirement. Any interface a human can see and interact with, an agent can now interact with. The implications are enormous.

Module Map

Lesson Table

#	Lesson	Key Concepts	What You Build
01	Computer Use Architecture	Screenshot-to-action loop, Anthropic tools, coordinate systems	Working computer use agent with Docker sandbox
02	Browser Agents	Playwright, DOM vs vision, session management, anti-bot	Browser agent for e-commerce data extraction
03	GUI Automation with Vision	Desktop GUIs, coordinate grounding, OCR, PyAutoGUI	Vision-based desktop automation agent
04	Web Scraping Agents	Dynamic rendering, auth handling, pagination, extraction	Full scraping agent with login + pagination
05	Safety and Sandboxing	Threat model, Docker sandbox, action confirmation, logging	Sandboxed agent with approval gates
06	Benchmarks	WebArena, OSWorld, ScreenSpot, SOTA results, evaluation	Custom test suite for your agent

Key Concepts You Will Learn

The Perception-Action Loop Computer use agents operate in a cycle: take a screenshot, analyze the visual state, decide what action to take, execute that action, and repeat. This is the same loop a human uses at a computer - just automated.

Three Abstraction Levels Computer use sits at the intersection of three technologies: vision models that understand screenshots, action executors that send inputs to the OS, and LLM reasoning that connects perception to action.

The Anthropic Computer Use API Three tools: computer (screenshot, click, type, scroll), text_editor (read/write files), and bash (run commands). Together they give an agent complete control over a Linux desktop.

Browser Agents vs Desktop Agents Browser agents (Playwright + LLM) are the most practical form today. They handle 90% of real-world computer use tasks: forms, dashboards, data extraction, login flows. Desktop agents go further - interacting with native apps.

Safety Is Not Optional An agent that can click anything and type anything is dangerous. Prompt injection via malicious screen content is a real attack vector. Every production computer use deployment needs sandboxing, action confirmation, and logging.

What You Will Build

By the end of this module you will have a working browser agent that:

Accepts a natural language task (e.g., "Find the cheapest laptop on [site] under $800 with at least 16GB RAM")
Launches a sandboxed browser
Navigates the site, searches, filters, paginates
Extracts structured data (Pydantic model)
Returns structured results with confidence scores
Logs every action for debugging and audit

The agent handles login walls, pagination, timeouts, and unexpected popups - all without brittle CSS selectors.

Prerequisites

Completed Module 01 (Agent Foundations) and Module 02 (Tools and Function Calling)
Python 3.11+
Docker installed (for the sandbox lessons)
Anthropic API key with computer use access
pip install anthropic playwright pydantic for the coding exercises

Why Computer Use Matters

The long arc of software is toward natural interfaces. First we had command lines. Then GUIs. Then web. Then mobile. Each transition created new automation gaps - the GUI was hard to automate, then the web was hard to automate.

Computer use agents close the final gap: anything a human can see, an agent can now operate. Legacy systems, internal tools with no API, complex multi-step web workflows - all become automatable.

This is not a future capability. It is available today, in production, at Anthropic. Understanding how it works - and how to use it safely - is a core competency for AI engineers in 2024 and beyond.

What This Module Is About​

Module Map​

Lesson Table​

Key Concepts You Will Learn​

What You Will Build​

Prerequisites​

Why Computer Use Matters​