83 docs tagged with "agentic-ai"

01 - Agent Risk Taxonomy

Eight categories of agent risk, the confused deputy problem, severity matrices, and a Python risk assessment module.

01 - Task Decomposition

How agents break complex goals into ordered, dependency-tracked subtasks. Hierarchical decomposition, DAG representation, dynamic replanning, and full Python implementation.

02 - Minimal Footprint Principle

Least privilege, reversibility preference, scope confirmation, and a Python minimal-footprint agent wrapper.

02 - Planning with LLMs

Zero-shot, chain-of-thought, Tree of Thoughts, ReWOO, and MCTS-guided planning. When LLM plans fail and how to recover. Full Python implementation of Tree of Thoughts.

03 - Checkpointing and Recovery

How to save agent state mid-run, resume after failures, design idempotent actions, and build production-grade checkpoint systems with SQLite and S3.

03 - Prompt Injection in Agents

Indirect prompt injection attacks, real-world examples, detection and defense strategies, and a Python injection defense system.

04 - Guardrails and Action Validation

Pre- and post-action guardrails, composable validators, denylist enforcement, rate limiting, and a complete Python guardrail pipeline.

04 - Handling Ambiguity and Clarification

How agents detect ambiguous instructions, decide when to ask vs. proceed, design targeted clarification questions, and avoid the overly-cautious anti-pattern.

05 - Interruption and Human-in-the-Loop

When and how agents pause for human judgment. Action classification, async approval workflows, Slack-based HITL, and resuming after interruption.

06 - Evaluation of Long-Horizon Tasks

How to evaluate multi-step agent trajectories. Task completion, path quality, error recovery, efficiency, and LLM-as-judge. Benchmarks and trajectory scorers.

Agent Communication Protocols

How agents pass information: message formats, schemas, synchronous vs async, routing, error propagation, and tracing through multi-agent systems.

Agent vs Chatbot vs Workflow

Precise technical definitions for chatbots, workflows, and AI agents - with decision criteria, cost/reliability tradeoffs, and code examples of all three for the same task.

Agentic AI - Engineering Track

Build production-grade AI agents - from MCP and tool use to multi-agent systems and long-horizon task completion.

Agentic Code Editing

How coding agents read, navigate, and surgically modify existing codebases: edit strategies, minimal diffs, regression prevention, and multi-file coordination.

Agentic Design Patterns

The 5 core patterns from Anthropic's research - prompt chaining, routing, parallelization, orchestrator-subagents, and evaluator-optimizer - with full Python implementations.

AutoGen Conversational Agents

Microsoft AutoGen v0.4 - event-driven multi-agent runtime, AgentChat teams, code execution, and production patterns for conversational AI systems.

AutoGen Deep Dive

Microsoft AutoGen v0.4: async conversational multi-agent systems, actor model architecture, group chat patterns, and MagenticOne.

Benchmarks: WebArena and OSWorld

Understanding computer use agent benchmarks - WebArena, OSWorld, ScreenSpot, Mind2Web. Current SOTA results, what the numbers mean, and how to evaluate your own agent.

Browser Agents

Building practical browser agents using Playwright and LLMs - DOM manipulation, visual navigation, session management, anti-bot handling, and complete Python implementation.

Building an MCP Server

Hands-on guide to building a production-quality MCP filesystem server in Python using the official MCP SDK - complete with 4 tools, resources, MCP Inspector testing, and Claude Desktop integration.

Building Your Own Coding Agent

Build a complete, functional coding agent from scratch in Python. Architecture decisions, repo maps, context management, system prompts, safety, and the full 500-line agent.

Challenges of Evaluating Agents

Why evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.

Computer Use Architecture

How Anthropic's Computer Use API works - the screenshot-action loop, the three tools, coordinate systems, and building a working computer use agent with Docker.

CrewAI

CrewAI v0.80+: role-based multi-agent systems with Crew, Agent, Task, Process, and Flow - the most production-friendly multi-agent framework.

CrewAI Multi-Agent Systems

CrewAI in production - agents, tasks, crews, memory systems, Flows, and deep-dive patterns for role-based multi-agent pipelines.

Cross-Session Persistence

How to build agents whose memory survives restarts - architecture, storage backends, session restoration, and privacy-aware memory pruning for production systems.

Debate and Critique Patterns

How LLMs critiquing each other improves quality: verifier/critic patterns, multi-agent debate, ensemble approaches, and convergence detection.

Episodic Memory with Vector Store

Implement agent episodic memory using vector databases: storing, retrieving, consolidating, and forgetting past experiences at scale.

Four Types of Agent Memory

Cognitive science meets AI engineering: working, episodic, semantic, and procedural memory implemented in production agent systems.

Framework Comparison

Comprehensive comparison of LangGraph, CrewAI, AutoGen, LlamaIndex, and raw API across 12 production dimensions - with decision flowchart and real case studies.

GAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.

GUI Automation with Vision

Vision-based GUI automation for desktop applications - coordinate grounding, UI element detection, OCR integration, state tracking, and building a desktop automation agent.

How Coding Agents Work

Deep dive into coding agent architecture: how agents navigate codebases, plan edits, execute changes, and iterate using test feedback.

Human Evaluation for Agents

When and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.

Human Oversight Mechanisms

Design human oversight that is meaningful, not performative - risk-based interruption, async approval queues, audit trails, and graduated autonomy.

In-Context Working Memory

Managing the context window as working memory: token budgeting, sliding windows, summarization, and the lost-in-the-middle problem.

LangChain Architecture - REPLACED

replaced

LangGraph

LangGraph: stateful graph-based multi-agent systems with checkpointing, human-in-the-loop, streaming, and the supervisor pattern - the most powerful and flexible agent framework.

LangGraph for Stateful Agents

Graph-based stateful agent orchestration with LangGraph - StateGraph, typed state, nodes, conditional edges, checkpointing, and human-in-the-loop.

LlamaIndex Architecture

LlamaIndex's document-centric agent framework - VectorStoreIndex, QueryEngine, FunctionCallingAgent, and the Workflow event-driven orchestration model.

LLM as Agent Judge

Using LLMs to evaluate other agents' trajectories and outputs at scale - rubric design, pairwise comparison, bias mitigation, calibration, and escalation logic.

MCP Architecture - Client-Server

Deep dive into MCP's client-server architecture - Host, Client, and Server roles; stdio and HTTP+SSE transport layers; JSON-RPC 2.0 message format; initialization handshake; capability negotiation; and full lifecycle.

MCP Ecosystem and Servers

The growing MCP ecosystem - official Anthropic servers, community landscape, MCP registries, evaluating third-party servers, IDE integrations, and patterns for building ecosystem vs. team-specific servers.

MCP Security and Permissions

Security model of the Model Context Protocol - attack surfaces including tool poisoning, resource injection, and confused deputy attacks, plus permission scoping, transport security, and a production security checklist.

MCP Tools, Resources, and Prompts

Deep dive into MCP's three primitives - Tools (callable functions), Resources (readable data), and Prompts (reusable templates) - with complete Python implementations of each.

MCP vs Function Calling

Deep architectural comparison of MCP and function calling - where each operates, when to use each, the decision matrix, hybrid patterns, and how to migrate from function calling to MCP.

Memory Compression and Summarization

How to keep agents functional across days-long tasks by compressing memory intelligently - preserving what matters, discarding what does not.

Module 01: Agentic Foundations

Master the foundational concepts of AI agents - what they are, how they reason, how they act, and when to use them.

Module 03: Computer Use Agents

How AI agents see, understand, and interact with graphical interfaces - browsers, desktops, and GUIs - using vision models and action executors.

Module 04: Coding Agents

Coding agents are the most commercially successful form of agentic AI. Learn how GitHub Copilot, Cursor, Devin, and Claude Code work under the hood.

Module 05: Long-Horizon Planning

How agents decompose complex multi-step tasks, plan across long horizons, recover from failures, and know when to ask for help.

Module 06: Agent Memory

How agents store, retrieve, and manage knowledge across interactions - working memory, episodic memory, semantic memory, procedural memory, and cross-session persistence.

Module 07: Multi-Agent Systems

Orchestration, communication, parallelism, and real frameworks - from first principles to production multi-agent systems.

Module 08: Agent Evaluation

Evaluation is the most underrated problem in agentic AI. Without it, you cannot improve, catch regressions, or build trust. This module covers trajectory scoring, benchmarks, LLM-as-judge, human evaluation, and production monitoring.

Module 09: Agent Safety

Risk taxonomy, minimal footprint, prompt injection defense, guardrails, human oversight, sandboxing, and responsible deployment.

Module 10: Agent Frameworks

LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen, raw API - an honest comparison with production lessons.

Module 2: Model Context Protocol

A module map of the Model Context Protocol - from core concepts through architecture, primitives, building servers, security, ecosystem, and comparison with function calling.

OpenAI Swarm

OpenAI's experimental multi-agent framework: agents, handoffs, context variables, and the triage pattern. What it gets right and wrong.

Orchestrator-Subagent Pattern

The most reliable multi-agent pattern: one orchestrator plans, subagents execute. Deep dive into task decomposition, assignment strategies, and production-grade implementation.

Parallel Agent Execution

Running agents concurrently with asyncio, worker pools, DAG-based scheduling, rate limiting, and cost/speed tradeoffs in parallel multi-agent systems.

Procedural Memory and Learned Skills

How agents store and reuse successful action sequences: skill formation, retrieval, composition, and refinement from execution feedback.

Production Agent Monitoring

Monitoring agents in production - task completion metrics, distributed tracing, anomaly detection, alerting, and the production improvement flywheel.

Production Lessons

12 hard-won lessons from deploying agentic systems at scale - each with a war story, a principle, and a code pattern you can use today.

Raw API Agent Patterns

Building production agents with just the Anthropic SDK - the agentic loop, tool handling, context management, cost tracking, and a complete 200-line implementation.

Responsible Agentic AI

Safety principles, EU AI Act compliance, accountability chains, bias, privacy, red-teaming, and building a safety review process for autonomous agent systems.

Safety and Sandboxing

Safety architecture for computer use agents - threat models, prompt injection, Docker sandboxing, action confirmation gates, logging, and anomaly detection.

Sandboxing Agent Environments

Contain the blast radius of any agent failure - process isolation, Docker security hardening, network policy, E2B cloud sandboxes, and escape vector prevention.

Semantic Memory and Knowledge Graphs

Structured world knowledge for agents: building and querying knowledge graphs with entity extraction, relationship traversal, and hybrid vector+graph retrieval.

SWE-bench and Evaluation

How to evaluate coding agents: SWE-bench, SWE-bench Verified, SOTA numbers, failure modes, and building your own evaluation harness.

SWE-bench Verified

SWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.

Test-Driven Agent Loops

The most powerful technique for coding agents: use test output as the ground truth feedback signal. TDD loops, pytest integration, output parsing, and backtracking.

The Agent Loop: Observe, Think, Act

Master the Observe-Think-Act loop that drives every AI agent - from the detailed mechanics of each phase to error handling, backtracking, and token management.

The ReAct Pattern

Master the ReAct (Reasoning + Acting) pattern - the 2022 breakthrough that grounds LLM reasoning in real observations and prevents hallucination in agents.

Tool Use and Function Calling

Master how AI agents call tools - from JSON schema definitions to parallel execution, error handling, and the tool design principles that make agents reliable.

Tool Use for Coding

Complete coding agent tool set: file operations, bash execution, search, git integration, LSP queries - full implementations with safety and error handling.

Tool Use from Python

Building LLM tool use systems in Python -- function calling, tool schemas, execution loops, error handling, and multi-step agent patterns.

Trajectory Evaluation

Evaluating the full action sequence, not just the final output - trajectory metrics, automatic scoring, and comparing agent versions.

Web Scraping Agents

Agent-based web scraping - handling dynamic JavaScript rendering, login flows, multi-page pagination, structured data extraction, and anti-detection techniques.

What are AI Agents?

Understand precisely what an AI agent is - the definition, the 5 key properties, the taxonomy, and why LLMs finally made agents practical.

What is MCP?

The Model Context Protocol - announced by Anthropic in November 2024 - solves the N×M integration problem by giving AI systems a standard way to connect to any tool or data source.

When to Use a Framework

The framework vs raw API decision for agents - what abstractions cost, what they provide, and a decision tree based on your actual requirements.

When to Use Agents

A decision framework for when autonomous agents are appropriate vs. when simpler approaches are better - covering cost of agency, task classification, anti-patterns, and ROI analysis.

Why Multi-Agent Systems?

The fundamental case for multi-agent: parallelization, specialization, and verification - and the honest cost of coordination overhead.