Skip to main content

Module 08 Projects - Concurrency

These projects are engineering specifications, not tutorials. Each spec defines exactly what the finished system must do - it is your job to figure out how to build it. Read every requirement carefully before writing a single line of code.

By the end of this module you will have built two independent, production-flavored concurrent systems: one that demonstrates threading-based and asyncio-based parallelism by crawling many URLs simultaneously, and one that demonstrates async API design by aggregating live external data with caching, rate limiting, and fault tolerance.

Project Summary

#ProjectConcurrency ModelKey SkillsDifficulty
01Concurrent Web ScraperThreadPoolExecutor + asyncio + aiohttpSemaphores, retry with exponential backoff, timeout per request, domain rate limiting, structured outputIntermediate
02Async Data Aggregation APIasyncio + FastAPI + asyncpgasyncio.gather, Semaphore, wait_for, in-memory TTL cache, circuit breaker, background refresh, health endpointIntermediate–Advanced

What These Projects Test

Project 01 - Concurrent Web Scraper

A command-line scraper that fetches and parses a configurable list of URLs in parallel. You will implement it twice: once with ThreadPoolExecutor (threading model) and once with asyncio and aiohttp (cooperative concurrency model). Comparing the two implementations side-by-side is the core learning goal.

Skills assessed:

  • Controlling concurrency with a semaphore to avoid overwhelming servers
  • Retry with exponential backoff - how to handle transient failures without hammering a struggling server
  • Per-request timeouts - ensuring one slow URL cannot hold up the entire scrape
  • Domain-level rate limiting - respecting politeness by spacing requests to the same domain
  • Graceful handling of connection errors, HTTP errors, and malformed HTML without crashing the scraper
  • Producing structured, machine-readable output (JSON or CSV) rather than raw print statements

Project 02 - Async Data Aggregation API

A FastAPI service that calls three or more external APIs concurrently and returns an aggregated response. You will add an in-memory cache with TTL, a background task that refreshes the cache proactively, a circuit breaker that detects failing upstreams, and a health check endpoint that exposes the status of every upstream.

Skills assessed:

  • asyncio.gather for concurrent outbound calls with per-task error isolation (return_exceptions=True)
  • asyncio.Semaphore to cap concurrent external connections
  • asyncio.wait_for to enforce per-call timeouts so slow upstreams cannot stall your handler
  • In-memory TTL cache - serving stale-but-fresh data while background tasks refresh
  • Circuit breaker - stopping calls to a repeatedly failing upstream and recovering automatically
  • FastAPI BackgroundTasks for cache refresh without blocking the response
  • Lifespan events (asynccontextmanager) for creating and tearing down shared resources (DB pool, HTTP client)
  • Health endpoint that shows per-upstream circuit breaker state and DB connectivity

How to Approach Each Project

1. Read the entire spec before writing any code.
2. Identify which concurrency primitive solves each sub-problem.
3. Build the smallest working version first - one URL, one upstream - before adding concurrency.
4. Add one layer at a time: timeout → retry → semaphore → structured output.
5. Test failure scenarios explicitly: kill a server, introduce latency, return 500 errors.
6. Run the acceptance criteria as a checklist - every item must pass.
7. Attempt at least one extension challenge.

Ground Rules

  • All external calls must have explicit timeouts. A call without a timeout is a hidden reliability bug.
  • Failures in one task must not crash sibling tasks. Use return_exceptions=True with gather, or try/except inside each worker.
  • Concurrency must be bounded. A semaphore or a configured max_workers is required - unbounded concurrency is a denial-of-service risk against the targets you are calling.
  • Structured output is required. Both projects produce machine-readable output (JSON or a well-defined dict structure), not raw print statements or ad-hoc strings.
  • No asyncio.sleep(0) as a substitute for real async operations. Use actual async I/O libraries (aiohttp, httpx, asyncpg) for network and database operations.

Project 01

scraper/
├── scraper_threads.py # ThreadPoolExecutor implementation
├── scraper_async.py # asyncio + aiohttp implementation
├── parser.py # shared HTML parsing logic (sync)
├── models.py # ScrapeResult dataclass
├── retry.py # exponential backoff utility
├── rate_limiter.py # per-domain delay tracker
├── output.py # JSON / CSV writers
└── tests/
├── test_retry.py
└── test_parser.py

Project 02

aggregator/
├── main.py # FastAPI app, lifespan, routes
├── fetcher.py # per-upstream fetch logic, circuit breakers
├── cache.py # in-memory TTL cache
├── background.py # cache refresh background task
├── models.py # Pydantic request/response models
├── config.py # upstream URLs, timeout config, semaphore sizes
└── tests/
├── test_fetcher.py
├── test_cache.py
└── test_routes.py

Concurrency Model Comparison

Understanding when to choose threading vs asyncio is the meta-lesson of this module:

DimensionThreadPoolExecutorasyncio + async I/O
Concurrency unitOS threadCoroutine
Memory per unit~8 MB (stack)~few KB
Max practical concurrency~50–200 threads~10,000+ coroutines
GIL impactLimits CPU parallelismIrrelevant (single thread)
Good forWrapping legacy sync librariesHigh-concurrency I/O-bound work
Blocking call in workerBlocks only that threadBlocks the ENTIRE event loop
Learning curveLowerHigher

Both projects require you to choose the right model and justify the choice. The scraper project has you implement both so you can compare them empirically.

© 2026 EngineersOfAI. All rights reserved.