Module 08 Projects - Concurrency
These projects are engineering specifications, not tutorials. Each spec defines exactly what the finished system must do - it is your job to figure out how to build it. Read every requirement carefully before writing a single line of code.
By the end of this module you will have built two independent, production-flavored concurrent systems: one that demonstrates threading-based and asyncio-based parallelism by crawling many URLs simultaneously, and one that demonstrates async API design by aggregating live external data with caching, rate limiting, and fault tolerance.
Project Summary
| # | Project | Concurrency Model | Key Skills | Difficulty |
|---|---|---|---|---|
| 01 | Concurrent Web Scraper | ThreadPoolExecutor + asyncio + aiohttp | Semaphores, retry with exponential backoff, timeout per request, domain rate limiting, structured output | Intermediate |
| 02 | Async Data Aggregation API | asyncio + FastAPI + asyncpg | asyncio.gather, Semaphore, wait_for, in-memory TTL cache, circuit breaker, background refresh, health endpoint | Intermediate–Advanced |
What These Projects Test
Project 01 - Concurrent Web Scraper
A command-line scraper that fetches and parses a configurable list of URLs in parallel. You will implement it twice: once with ThreadPoolExecutor (threading model) and once with asyncio and aiohttp (cooperative concurrency model). Comparing the two implementations side-by-side is the core learning goal.
Skills assessed:
- Controlling concurrency with a semaphore to avoid overwhelming servers
- Retry with exponential backoff - how to handle transient failures without hammering a struggling server
- Per-request timeouts - ensuring one slow URL cannot hold up the entire scrape
- Domain-level rate limiting - respecting politeness by spacing requests to the same domain
- Graceful handling of connection errors, HTTP errors, and malformed HTML without crashing the scraper
- Producing structured, machine-readable output (JSON or CSV) rather than raw print statements
Project 02 - Async Data Aggregation API
A FastAPI service that calls three or more external APIs concurrently and returns an aggregated response. You will add an in-memory cache with TTL, a background task that refreshes the cache proactively, a circuit breaker that detects failing upstreams, and a health check endpoint that exposes the status of every upstream.
Skills assessed:
asyncio.gatherfor concurrent outbound calls with per-task error isolation (return_exceptions=True)asyncio.Semaphoreto cap concurrent external connectionsasyncio.wait_forto enforce per-call timeouts so slow upstreams cannot stall your handler- In-memory TTL cache - serving stale-but-fresh data while background tasks refresh
- Circuit breaker - stopping calls to a repeatedly failing upstream and recovering automatically
- FastAPI
BackgroundTasksfor cache refresh without blocking the response - Lifespan events (
asynccontextmanager) for creating and tearing down shared resources (DB pool, HTTP client) - Health endpoint that shows per-upstream circuit breaker state and DB connectivity
How to Approach Each Project
1. Read the entire spec before writing any code.
2. Identify which concurrency primitive solves each sub-problem.
3. Build the smallest working version first - one URL, one upstream - before adding concurrency.
4. Add one layer at a time: timeout → retry → semaphore → structured output.
5. Test failure scenarios explicitly: kill a server, introduce latency, return 500 errors.
6. Run the acceptance criteria as a checklist - every item must pass.
7. Attempt at least one extension challenge.
Ground Rules
- All external calls must have explicit timeouts. A call without a timeout is a hidden reliability bug.
- Failures in one task must not crash sibling tasks. Use
return_exceptions=Truewithgather, ortry/exceptinside each worker. - Concurrency must be bounded. A semaphore or a configured
max_workersis required - unbounded concurrency is a denial-of-service risk against the targets you are calling. - Structured output is required. Both projects produce machine-readable output (JSON or a well-defined dict structure), not raw print statements or ad-hoc strings.
- No
asyncio.sleep(0)as a substitute for real async operations. Use actual async I/O libraries (aiohttp,httpx,asyncpg) for network and database operations.
Folder Structure (Recommended)
Project 01
scraper/
├── scraper_threads.py # ThreadPoolExecutor implementation
├── scraper_async.py # asyncio + aiohttp implementation
├── parser.py # shared HTML parsing logic (sync)
├── models.py # ScrapeResult dataclass
├── retry.py # exponential backoff utility
├── rate_limiter.py # per-domain delay tracker
├── output.py # JSON / CSV writers
└── tests/
├── test_retry.py
└── test_parser.py
Project 02
aggregator/
├── main.py # FastAPI app, lifespan, routes
├── fetcher.py # per-upstream fetch logic, circuit breakers
├── cache.py # in-memory TTL cache
├── background.py # cache refresh background task
├── models.py # Pydantic request/response models
├── config.py # upstream URLs, timeout config, semaphore sizes
└── tests/
├── test_fetcher.py
├── test_cache.py
└── test_routes.py
Concurrency Model Comparison
Understanding when to choose threading vs asyncio is the meta-lesson of this module:
| Dimension | ThreadPoolExecutor | asyncio + async I/O |
|---|---|---|
| Concurrency unit | OS thread | Coroutine |
| Memory per unit | ~8 MB (stack) | ~few KB |
| Max practical concurrency | ~50–200 threads | ~10,000+ coroutines |
| GIL impact | Limits CPU parallelism | Irrelevant (single thread) |
| Good for | Wrapping legacy sync libraries | High-concurrency I/O-bound work |
| Blocking call in worker | Blocks only that thread | Blocks the ENTIRE event loop |
| Learning curve | Lower | Higher |
Both projects require you to choose the right model and justify the choice. The scraper project has you implement both so you can compare them empirically.
