Skip to main content

83 docs tagged with "agentic-ai"

View all tags

01 - Agent Risk Taxonomy

Eight categories of agent risk, the confused deputy problem, severity matrices, and a Python risk assessment module.

01 - Task Decomposition

How agents break complex goals into ordered, dependency-tracked subtasks. Hierarchical decomposition, DAG representation, dynamic replanning, and full Python implementation.

02 - Planning with LLMs

Zero-shot, chain-of-thought, Tree of Thoughts, ReWOO, and MCTS-guided planning. When LLM plans fail and how to recover. Full Python implementation of Tree of Thoughts.

03 - Checkpointing and Recovery

How to save agent state mid-run, resume after failures, design idempotent actions, and build production-grade checkpoint systems with SQLite and S3.

03 - Prompt Injection in Agents

Indirect prompt injection attacks, real-world examples, detection and defense strategies, and a Python injection defense system.

06 - Evaluation of Long-Horizon Tasks

How to evaluate multi-step agent trajectories. Task completion, path quality, error recovery, efficiency, and LLM-as-judge. Benchmarks and trajectory scorers.

Agent Communication Protocols

How agents pass information: message formats, schemas, synchronous vs async, routing, error propagation, and tracing through multi-agent systems.

Agent vs Chatbot vs Workflow

Precise technical definitions for chatbots, workflows, and AI agents - with decision criteria, cost/reliability tradeoffs, and code examples of all three for the same task.

Agentic Code Editing

How coding agents read, navigate, and surgically modify existing codebases: edit strategies, minimal diffs, regression prevention, and multi-file coordination.

Agentic Design Patterns

The 5 core patterns from Anthropic's research - prompt chaining, routing, parallelization, orchestrator-subagents, and evaluator-optimizer - with full Python implementations.

AutoGen Conversational Agents

Microsoft AutoGen v0.4 - event-driven multi-agent runtime, AgentChat teams, code execution, and production patterns for conversational AI systems.

AutoGen Deep Dive

Microsoft AutoGen v0.4: async conversational multi-agent systems, actor model architecture, group chat patterns, and MagenticOne.

Benchmarks: WebArena and OSWorld

Understanding computer use agent benchmarks - WebArena, OSWorld, ScreenSpot, Mind2Web. Current SOTA results, what the numbers mean, and how to evaluate your own agent.

Browser Agents

Building practical browser agents using Playwright and LLMs - DOM manipulation, visual navigation, session management, anti-bot handling, and complete Python implementation.

Building an MCP Server

Hands-on guide to building a production-quality MCP filesystem server in Python using the official MCP SDK - complete with 4 tools, resources, MCP Inspector testing, and Claude Desktop integration.

Building Your Own Coding Agent

Build a complete, functional coding agent from scratch in Python. Architecture decisions, repo maps, context management, system prompts, safety, and the full 500-line agent.

Challenges of Evaluating Agents

Why evaluating agentic systems is fundamentally harder than evaluating static models - the multi-path problem, compound errors, latent failures, and how to build an evaluation mindset.

Computer Use Architecture

How Anthropic's Computer Use API works - the screenshot-action loop, the three tools, coordinate systems, and building a working computer use agent with Docker.

CrewAI

CrewAI v0.80+: role-based multi-agent systems with Crew, Agent, Task, Process, and Flow - the most production-friendly multi-agent framework.

CrewAI Multi-Agent Systems

CrewAI in production - agents, tasks, crews, memory systems, Flows, and deep-dive patterns for role-based multi-agent pipelines.

Cross-Session Persistence

How to build agents whose memory survives restarts - architecture, storage backends, session restoration, and privacy-aware memory pruning for production systems.

Debate and Critique Patterns

How LLMs critiquing each other improves quality: verifier/critic patterns, multi-agent debate, ensemble approaches, and convergence detection.

Episodic Memory with Vector Store

Implement agent episodic memory using vector databases: storing, retrieving, consolidating, and forgetting past experiences at scale.

Four Types of Agent Memory

Cognitive science meets AI engineering: working, episodic, semantic, and procedural memory implemented in production agent systems.

Framework Comparison

Comprehensive comparison of LangGraph, CrewAI, AutoGen, LlamaIndex, and raw API across 12 production dimensions - with decision flowchart and real case studies.

GAIA Benchmark

GAIA tests general-purpose agents on real-world tasks requiring web search, file reading, code execution, and multi-step reasoning. Learn the task structure, scoring, SOTA analysis, and how to build GAIA-style evaluations.

GUI Automation with Vision

Vision-based GUI automation for desktop applications - coordinate grounding, UI element detection, OCR integration, state tracking, and building a desktop automation agent.

How Coding Agents Work

Deep dive into coding agent architecture: how agents navigate codebases, plan edits, execute changes, and iterate using test feedback.

Human Evaluation for Agents

When and how to run human evaluation for agentic systems - annotator selection, rubric design, inter-annotator agreement, crowdsourcing quality control, and closing the feedback loop.

Human Oversight Mechanisms

Design human oversight that is meaningful, not performative - risk-based interruption, async approval queues, audit trails, and graduated autonomy.

In-Context Working Memory

Managing the context window as working memory: token budgeting, sliding windows, summarization, and the lost-in-the-middle problem.

LangGraph

LangGraph: stateful graph-based multi-agent systems with checkpointing, human-in-the-loop, streaming, and the supervisor pattern - the most powerful and flexible agent framework.

LangGraph for Stateful Agents

Graph-based stateful agent orchestration with LangGraph - StateGraph, typed state, nodes, conditional edges, checkpointing, and human-in-the-loop.

LlamaIndex Architecture

LlamaIndex's document-centric agent framework - VectorStoreIndex, QueryEngine, FunctionCallingAgent, and the Workflow event-driven orchestration model.

LLM as Agent Judge

Using LLMs to evaluate other agents' trajectories and outputs at scale - rubric design, pairwise comparison, bias mitigation, calibration, and escalation logic.

MCP Architecture - Client-Server

Deep dive into MCP's client-server architecture - Host, Client, and Server roles; stdio and HTTP+SSE transport layers; JSON-RPC 2.0 message format; initialization handshake; capability negotiation; and full lifecycle.

MCP Ecosystem and Servers

The growing MCP ecosystem - official Anthropic servers, community landscape, MCP registries, evaluating third-party servers, IDE integrations, and patterns for building ecosystem vs. team-specific servers.

MCP Security and Permissions

Security model of the Model Context Protocol - attack surfaces including tool poisoning, resource injection, and confused deputy attacks, plus permission scoping, transport security, and a production security checklist.

MCP Tools, Resources, and Prompts

Deep dive into MCP's three primitives - Tools (callable functions), Resources (readable data), and Prompts (reusable templates) - with complete Python implementations of each.

MCP vs Function Calling

Deep architectural comparison of MCP and function calling - where each operates, when to use each, the decision matrix, hybrid patterns, and how to migrate from function calling to MCP.

Module 03: Computer Use Agents

How AI agents see, understand, and interact with graphical interfaces - browsers, desktops, and GUIs - using vision models and action executors.

Module 04: Coding Agents

Coding agents are the most commercially successful form of agentic AI. Learn how GitHub Copilot, Cursor, Devin, and Claude Code work under the hood.

Module 06: Agent Memory

How agents store, retrieve, and manage knowledge across interactions - working memory, episodic memory, semantic memory, procedural memory, and cross-session persistence.

Module 07: Multi-Agent Systems

Orchestration, communication, parallelism, and real frameworks - from first principles to production multi-agent systems.

Module 08: Agent Evaluation

Evaluation is the most underrated problem in agentic AI. Without it, you cannot improve, catch regressions, or build trust. This module covers trajectory scoring, benchmarks, LLM-as-judge, human evaluation, and production monitoring.

Module 09: Agent Safety

Risk taxonomy, minimal footprint, prompt injection defense, guardrails, human oversight, sandboxing, and responsible deployment.

Module 2: Model Context Protocol

A module map of the Model Context Protocol - from core concepts through architecture, primitives, building servers, security, ecosystem, and comparison with function calling.

OpenAI Swarm

OpenAI's experimental multi-agent framework: agents, handoffs, context variables, and the triage pattern. What it gets right and wrong.

Orchestrator-Subagent Pattern

The most reliable multi-agent pattern: one orchestrator plans, subagents execute. Deep dive into task decomposition, assignment strategies, and production-grade implementation.

Parallel Agent Execution

Running agents concurrently with asyncio, worker pools, DAG-based scheduling, rate limiting, and cost/speed tradeoffs in parallel multi-agent systems.

Production Agent Monitoring

Monitoring agents in production - task completion metrics, distributed tracing, anomaly detection, alerting, and the production improvement flywheel.

Production Lessons

12 hard-won lessons from deploying agentic systems at scale - each with a war story, a principle, and a code pattern you can use today.

Raw API Agent Patterns

Building production agents with just the Anthropic SDK - the agentic loop, tool handling, context management, cost tracking, and a complete 200-line implementation.

Responsible Agentic AI

Safety principles, EU AI Act compliance, accountability chains, bias, privacy, red-teaming, and building a safety review process for autonomous agent systems.

Safety and Sandboxing

Safety architecture for computer use agents - threat models, prompt injection, Docker sandboxing, action confirmation gates, logging, and anomaly detection.

Sandboxing Agent Environments

Contain the blast radius of any agent failure - process isolation, Docker security hardening, network policy, E2B cloud sandboxes, and escape vector prevention.

Semantic Memory and Knowledge Graphs

Structured world knowledge for agents: building and querying knowledge graphs with entity extraction, relationship traversal, and hybrid vector+graph retrieval.

SWE-bench and Evaluation

How to evaluate coding agents: SWE-bench, SWE-bench Verified, SOTA numbers, failure modes, and building your own evaluation harness.

SWE-bench Verified

SWE-bench Verified is the gold standard for evaluating coding agents on real GitHub issues. Learn the evaluation methodology, Docker harness, failure mode taxonomy, and how to interpret benchmark scores.

Test-Driven Agent Loops

The most powerful technique for coding agents: use test output as the ground truth feedback signal. TDD loops, pytest integration, output parsing, and backtracking.

The Agent Loop: Observe, Think, Act

Master the Observe-Think-Act loop that drives every AI agent - from the detailed mechanics of each phase to error handling, backtracking, and token management.

The ReAct Pattern

Master the ReAct (Reasoning + Acting) pattern - the 2022 breakthrough that grounds LLM reasoning in real observations and prevents hallucination in agents.

Tool Use and Function Calling

Master how AI agents call tools - from JSON schema definitions to parallel execution, error handling, and the tool design principles that make agents reliable.

Tool Use for Coding

Complete coding agent tool set: file operations, bash execution, search, git integration, LSP queries - full implementations with safety and error handling.

Tool Use from Python

Building LLM tool use systems in Python -- function calling, tool schemas, execution loops, error handling, and multi-step agent patterns.

Trajectory Evaluation

Evaluating the full action sequence, not just the final output - trajectory metrics, automatic scoring, and comparing agent versions.

Web Scraping Agents

Agent-based web scraping - handling dynamic JavaScript rendering, login flows, multi-page pagination, structured data extraction, and anti-detection techniques.

What are AI Agents?

Understand precisely what an AI agent is - the definition, the 5 key properties, the taxonomy, and why LLMs finally made agents practical.

What is MCP?

The Model Context Protocol - announced by Anthropic in November 2024 - solves the N×M integration problem by giving AI systems a standard way to connect to any tool or data source.

When to Use a Framework

The framework vs raw API decision for agents - what abstractions cost, what they provide, and a decision tree based on your actual requirements.

When to Use Agents

A decision framework for when autonomous agents are appropriate vs. when simpler approaches are better - covering cost of agency, task classification, anti-patterns, and ROI analysis.

Why Multi-Agent Systems?

The fundamental case for multi-agent: parallelization, specialization, and verification - and the honest cost of coordination overhead.