01 - Agent Risk Taxonomy
Eight categories of agent risk, the confused deputy problem, severity matrices, and a Python risk assessment module.
Eight categories of agent risk, the confused deputy problem, severity matrices, and a Python risk assessment module.
How agents break complex goals into ordered, dependency-tracked subtasks. Hierarchical decomposition, DAG representation, dynamic replanning, and full Python implementation.
Least privilege, reversibility preference, scope confirmation, and a Python minimal-footprint agent wrapper.
Zero-shot, chain-of-thought, Tree of Thoughts, ReWOO, and MCTS-guided planning. When LLM plans fail and how to recover. Full Python implementation of Tree of Thoughts.
How to save agent state mid-run, resume after failures, design idempotent actions, and build production-grade checkpoint systems with SQLite and S3.
Indirect prompt injection attacks, real-world examples, detection and defense strategies, and a Python injection defense system.
Pre- and post-action guardrails, composable validators, denylist enforcement, rate limiting, and a complete Python guardrail pipeline.
How agents detect ambiguous instructions, decide when to ask vs. proceed, design targeted clarification questions, and avoid the overly-cautious anti-pattern.
When and how agents pause for human judgment. Action classification, async approval workflows, Slack-based HITL, and resuming after interruption.
How to evaluate multi-step agent trajectories. Task completion, path quality, error recovery, efficiency, and LLM-as-judge. Benchmarks and trajectory scorers.
How agents pass information: message formats, schemas, synchronous vs async, routing, error propagation, and tracing through multi-agent systems.
Precise technical definitions for chatbots, workflows, and AI agents - with decision criteria, cost/reliability tradeoffs, and code examples of all three for the same task.
Build production-grade AI agents - from MCP and tool use to multi-agent systems and long-horizon task completion.
How coding agents read, navigate, and surgically modify existing codebases: edit strategies, minimal diffs, regression prevention, and multi-file coordination.
The 5 core patterns from Anthropic's research - prompt chaining, routing, parallelization, orchestrator-subagents, and evaluator-optimizer - with full Python implementations.
Microsoft AutoGen v0.4 - event-driven multi-agent runtime, AgentChat teams, code execution, and production patterns for conversational AI systems.
Microsoft AutoGen v0.4: async conversational multi-agent systems, actor model architecture, group chat patterns, and MagenticOne.
Understanding computer use agent benchmarks - WebArena, OSWorld, ScreenSpot, Mind2Web. Current SOTA results, what the numbers mean, and how to evaluate your own agent.
Building practical browser agents using Playwright and LLMs - DOM manipulation, visual navigation, session management, anti-bot handling, and complete Python implementation.
Hands-on guide to building a production-quality MCP filesystem server in Python using the official MCP SDK - complete with 4 tools, resources, MCP Inspector testing, and Claude Desktop integration.
Build a complete, functional coding agent from scratch in Python. Architecture decisions, repo maps, context management, system prompts, safety, and the full 500-line agent.
Why evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.
How Anthropic's Computer Use API works - the screenshot-action loop, the three tools, coordinate systems, and building a working computer use agent with Docker.
CrewAI v0.80+: role-based multi-agent systems with Crew, Agent, Task, Process, and Flow - the most production-friendly multi-agent framework.
CrewAI in production - agents, tasks, crews, memory systems, Flows, and deep-dive patterns for role-based multi-agent pipelines.
How to build agents whose memory survives restarts - architecture, storage backends, session restoration, and privacy-aware memory pruning for production systems.
How LLMs critiquing each other improves quality: verifier/critic patterns, multi-agent debate, ensemble approaches, and convergence detection.
Implement agent episodic memory using vector databases: storing, retrieving, consolidating, and forgetting past experiences at scale.
Cognitive science meets AI engineering: working, episodic, semantic, and procedural memory implemented in production agent systems.
Comprehensive comparison of LangGraph, CrewAI, AutoGen, LlamaIndex, and raw API across 12 production dimensions - with decision flowchart and real case studies.
GAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.
Vision-based GUI automation for desktop applications - coordinate grounding, UI element detection, OCR integration, state tracking, and building a desktop automation agent.
Deep dive into coding agent architecture: how agents navigate codebases, plan edits, execute changes, and iterate using test feedback.
When and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.
Design human oversight that is meaningful, not performative - risk-based interruption, async approval queues, audit trails, and graduated autonomy.
Managing the context window as working memory: token budgeting, sliding windows, summarization, and the lost-in-the-middle problem.
replaced
LangGraph: stateful graph-based multi-agent systems with checkpointing, human-in-the-loop, streaming, and the supervisor pattern - the most powerful and flexible agent framework.
Graph-based stateful agent orchestration with LangGraph - StateGraph, typed state, nodes, conditional edges, checkpointing, and human-in-the-loop.
LlamaIndex's document-centric agent framework - VectorStoreIndex, QueryEngine, FunctionCallingAgent, and the Workflow event-driven orchestration model.
Using LLMs to evaluate other agents' trajectories and outputs at scale - rubric design, pairwise comparison, bias mitigation, calibration, and escalation logic.
Deep dive into MCP's client-server architecture - Host, Client, and Server roles; stdio and HTTP+SSE transport layers; JSON-RPC 2.0 message format; initialization handshake; capability negotiation; and full lifecycle.
The growing MCP ecosystem - official Anthropic servers, community landscape, MCP registries, evaluating third-party servers, IDE integrations, and patterns for building ecosystem vs. team-specific servers.
Security model of the Model Context Protocol - attack surfaces including tool poisoning, resource injection, and confused deputy attacks, plus permission scoping, transport security, and a production security checklist.
Deep dive into MCP's three primitives - Tools (callable functions), Resources (readable data), and Prompts (reusable templates) - with complete Python implementations of each.
Deep architectural comparison of MCP and function calling - where each operates, when to use each, the decision matrix, hybrid patterns, and how to migrate from function calling to MCP.
How to keep agents functional across days-long tasks by compressing memory intelligently - preserving what matters, discarding what does not.
Master the foundational concepts of AI agents - what they are, how they reason, how they act, and when to use them.
How AI agents see, understand, and interact with graphical interfaces - browsers, desktops, and GUIs - using vision models and action executors.
Coding agents are the most commercially successful form of agentic AI. Learn how GitHub Copilot, Cursor, Devin, and Claude Code work under the hood.
How agents decompose complex multi-step tasks, plan across long horizons, recover from failures, and know when to ask for help.
How agents store, retrieve, and manage knowledge across interactions - working memory, episodic memory, semantic memory, procedural memory, and cross-session persistence.
Orchestration, communication, parallelism, and real frameworks - from first principles to production multi-agent systems.
Evaluation is the most underrated problem in agentic AI. Without it, you cannot improve, catch regressions, or build trust. This module covers trajectory scoring, benchmarks, LLM-as-judge, human evaluation, and production monitoring.
Risk taxonomy, minimal footprint, prompt injection defense, guardrails, human oversight, sandboxing, and responsible deployment.
LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen, raw API - an honest comparison with production lessons.
A module map of the Model Context Protocol - from core concepts through architecture, primitives, building servers, security, ecosystem, and comparison with function calling.
OpenAI's experimental multi-agent framework: agents, handoffs, context variables, and the triage pattern. What it gets right and wrong.
The most reliable multi-agent pattern: one orchestrator plans, subagents execute. Deep dive into task decomposition, assignment strategies, and production-grade implementation.
Running agents concurrently with asyncio, worker pools, DAG-based scheduling, rate limiting, and cost/speed tradeoffs in parallel multi-agent systems.
How agents store and reuse successful action sequences: skill formation, retrieval, composition, and refinement from execution feedback.
Monitoring agents in production - task completion metrics, distributed tracing, anomaly detection, alerting, and the production improvement flywheel.
12 hard-won lessons from deploying agentic systems at scale - each with a war story, a principle, and a code pattern you can use today.
Building production agents with just the Anthropic SDK - the agentic loop, tool handling, context management, cost tracking, and a complete 200-line implementation.
Safety principles, EU AI Act compliance, accountability chains, bias, privacy, red-teaming, and building a safety review process for autonomous agent systems.
Safety architecture for computer use agents - threat models, prompt injection, Docker sandboxing, action confirmation gates, logging, and anomaly detection.
Contain the blast radius of any agent failure - process isolation, Docker security hardening, network policy, E2B cloud sandboxes, and escape vector prevention.
Structured world knowledge for agents: building and querying knowledge graphs with entity extraction, relationship traversal, and hybrid vector+graph retrieval.
How to evaluate coding agents: SWE-bench, SWE-bench Verified, SOTA numbers, failure modes, and building your own evaluation harness.
SWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.
The most powerful technique for coding agents: use test output as the ground truth feedback signal. TDD loops, pytest integration, output parsing, and backtracking.
Master the Observe-Think-Act loop that drives every AI agent - from the detailed mechanics of each phase to error handling, backtracking, and token management.
Master the ReAct (Reasoning + Acting) pattern - the 2022 breakthrough that grounds LLM reasoning in real observations and prevents hallucination in agents.
Master how AI agents call tools - from JSON schema definitions to parallel execution, error handling, and the tool design principles that make agents reliable.
Complete coding agent tool set: file operations, bash execution, search, git integration, LSP queries - full implementations with safety and error handling.
Building LLM tool use systems in Python -- function calling, tool schemas, execution loops, error handling, and multi-step agent patterns.
Evaluating the full action sequence, not just the final output - trajectory metrics, automatic scoring, and comparing agent versions.
Agent-based web scraping - handling dynamic JavaScript rendering, login flows, multi-page pagination, structured data extraction, and anti-detection techniques.
Understand precisely what an AI agent is - the definition, the 5 key properties, the taxonomy, and why LLMs finally made agents practical.
The Model Context Protocol - announced by Anthropic in November 2024 - solves the N×M integration problem by giving AI systems a standard way to connect to any tool or data source.
The framework vs raw API decision for agents - what abstractions cost, what they provide, and a decision tree based on your actual requirements.
A decision framework for when autonomous agents are appropriate vs. when simpler approaches are better - covering cost of agency, task classification, anti-patterns, and ROI analysis.
The fundamental case for multi-agent: parallelization, specialization, and verification - and the honest cost of coordination overhead.