The GIL Explained - What It Is, What It Isn't, and How to Work Around It
Reading time: ~30 minutes | Level: Intermediate → Engineering
Before reading further, predict the output:
import threading
counter = 0
def increment():
global counter
for _ in range(1_000_000):
counter += 1
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()
print(counter) # ?
Show Answer
The output is not 2,000,000. It's a non-deterministic number less than 2,000,000 - something like 1,387,241 or 1,823,904.
Most engineers expect the GIL to protect this. The GIL does prevent two threads from executing Python bytecodes simultaneously. But counter += 1 is not one bytecode - it compiles to four:
LOAD_GLOBAL counter # read counter's current value
LOAD_CONST 1 # push 1 onto the stack
BINARY_OP + # add them
STORE_GLOBAL counter # write the result back
The GIL can be released between any two of these bytecodes. Thread 1 can read counter = 500, then lose the GIL to Thread 2 which also reads counter = 500, increments to 501, and writes it back. Thread 1 then resumes with its stale value of 500, increments to 501, and overwrites Thread 2's update. One increment is silently lost.
The GIL is not a substitute for application-level locking.
Now consider: this is one of the most misunderstood aspects of Python. Engineers build multithreaded services expecting the GIL to protect shared state, only to discover data races in production under load. Understanding the GIL at bytecode depth - what it guards, what it does not, when it releases, and how to achieve real parallelism - is essential for writing correct concurrent Python.
What You Will Learn
- What the GIL is: a mutex protecting CPython's internal state
- Why it exists: CPython's reference counting is not thread-safe without it
- What the GIL does NOT protect: your application-level data structures
- How
counter += 1desugars to 4 bytecodes and why that matters - The check interval:
sys.getswitchinterval()and how to tune it - Why I/O releases the GIL and why threading works for I/O-bound tasks
- CPU-bound vs I/O-bound: the GIL is irrelevant for one and harmful for the other
multiprocessing: separate processes, separate GILs, real CPU parallelism- C extensions (NumPy, pandas, Pillow) that release the GIL for true parallelism
- Python 3.13 free-threaded mode: current status and tradeoffs
Prerequisites
- Lesson 02: Bytecode Inspection - you need to understand that Python code compiles to bytecodes
- Lesson 03: Disassembly with
dis- reading bytecode output - Familiarity with
threading.Threadbasics
Part 1 - What the GIL Is
The GIL Defined
The Global Interpreter Lock (GIL) is a mutex - a mutual exclusion lock - that CPython acquires before executing any Python bytecode and releases under specific conditions. Only one thread can hold the GIL at a time. Only the thread holding the GIL can execute Python bytecodes.
The GIL is not a Python language feature. It is an implementation detail of CPython - the reference interpreter written in C. Other Python implementations (Jython, IronPython, PyPy with STM) have different approaches.
Why the GIL Exists
CPython's memory management is built on reference counting. Every Python object carries a reference count (ob_refcnt). When you assign a variable, the count increments. When the variable goes out of scope, it decrements. When the count reaches zero, the object is freed.
Reference counting requires reads and writes to ob_refcnt on every object access. Without a global lock, two threads modifying the same object's reference count simultaneously would corrupt it - leading to use-after-free bugs, double frees, and memory corruption at the C level.
# Every one of these operations touches ob_refcnt internally
a = some_object # ob_refcnt += 1
b = a # ob_refcnt += 1
del a # ob_refcnt -= 1; if 0: free memory
result = func(b) # ob_refcnt += 1 (passing b increments it)
The GIL ensures these refcount operations are serialized. It also protects CPython's memory allocator, the bytecode execution loop, and internal data structures like dictionaries and lists from concurrent modification at the C level.
What the GIL Protects
The GIL protects:
- CPython's internal reference counts - the
ob_refcntfield on everyPyObject - CPython's memory allocator -
pymallocis not thread-safe without external serialization - CPython's internal data structures - the bytecode interpreter loop, import machinery,
sys.modules - Certain Python built-in operations - list
.append()is thread-safe because it's a single C-level operation that happens to be atomic under the GIL
What the GIL Does NOT Protect
The GIL does not protect:
- Your application-level data structures - dictionaries, lists, counters, flags you write in Python
- Multi-step Python operations - any operation that compiles to more than one bytecode
- Logic that spans multiple Python statements - check-then-act patterns, read-modify-write
This is the core misunderstanding. Developers see "GIL" and assume thread safety. They are wrong for anything beyond single-bytecode operations.
Part 2 - The GIL Release Points
The Check Interval
The GIL is not held indefinitely. CPython releases it periodically to give other threads a chance to run. The interval is controlled by sys.getswitchinterval():
import sys
print(sys.getswitchinterval()) # 0.005 - default 5 milliseconds
# You can change it (rarely a good idea in production)
sys.setswitchinterval(0.001) # 1ms - more frequent switching
sys.setswitchinterval(0.1) # 100ms - less frequent switching
Every 5ms (by default), the executing thread checks if another thread is waiting for the GIL. If so, the current thread releases the GIL, allowing the other thread to acquire it and execute.
Before Python 3.2, the check interval was measured in bytecodes (every 100 bytecodes). Python 3.2 switched to time-based intervals. The time-based approach is more predictable and reduces contention on I/O-heavy workloads.
I/O Operations Release the GIL
The most important GIL release point is I/O. Any time a Python thread performs a blocking I/O operation - reading from a file, making a network request, waiting on a socket - it releases the GIL before the system call and reacquires it after:
import threading
import urllib.request
# Both threads release the GIL during the HTTP request
# They execute truly concurrently at the OS level
def fetch(url):
with urllib.request.urlopen(url) as response: # GIL released here
return response.read()
t1 = threading.Thread(target=fetch, args=("https://httpbin.org/delay/1",))
t2 = threading.Thread(target=fetch, args=("https://httpbin.org/delay/1",))
t1.start(); t2.start()
t1.join(); t2.join()
# Takes ~1 second, not ~2 seconds - true concurrent I/O
During the urlopen() call, Thread 1 releases the GIL and blocks in the kernel waiting for the network. Thread 2 acquires the GIL and starts its own request. Both requests are in-flight simultaneously at the OS level. This is why threading works well for I/O-bound tasks.
The GIL and time.sleep()
time.sleep() also releases the GIL:
import threading, time
def worker(n):
time.sleep(1) # GIL released during sleep - other threads run
print(f"Worker {n} done")
threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()
# All 5 workers sleep concurrently - total time ~1s, not ~5s
Part 3 - Why counter += 1 Loses Updates
Bytecode-Level Analysis
Let's disassemble the increment function to see exactly what bytecodes run:
import dis
def increment():
global counter
counter += 1
dis.dis(increment)
Output (Python 3.12):
3 0 LOAD_GLOBAL 0 (counter)
2 LOAD_CONST 1 (1)
4 BINARY_OP 0 (+)
6 STORE_GLOBAL 0 (counter)
8 RETURN_CONST 0 (None)
Four bytecodes execute sequentially. The GIL can be released between any two:
Thread 1 read 500, lost the GIL, Thread 2 read the same 500, both computed 501, and both wrote 501. One increment was lost. This can happen anywhere the GIL switches between LOAD_GLOBAL and STORE_GLOBAL.
The Fix: threading.Lock
To make the counter thread-safe, wrap the read-modify-write in a threading.Lock:
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
for _ in range(1_000_000):
with lock:
counter += 1 # only one thread executes this at a time
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()
print(counter) # always 2,000,000
Or use threading.local() for per-thread state, or redesign to avoid shared mutable state entirely.
:::danger The GIL Does NOT Protect Your Application-Level Data Structures
A threading.Lock is required for any shared mutable state your Python code reads and writes across threads. The GIL only protects CPython internals. counter += 1, dict[key] = value after a check, any multi-step operation - these are all race conditions without a Lock.
# This looks safe but is NOT - two threads can both pass the check
# before either writes, leading to duplicate processing
if key not in results: # LOAD_GLOBAL, BINARY_OP...
results[key] = compute(key) # STORE_SUBSCR
# Safe version
with lock:
if key not in results:
results[key] = compute(key)
:::
Part 4 - CPU-Bound vs I/O-Bound
The Fundamental Split
The GIL's impact depends entirely on what your program spends time doing:
CPU-Bound: Threading Makes It Worse
import threading, time
def cpu_work(n):
"""Pure CPU - no I/O, no sleep."""
result = 0
for i in range(n):
result += i * i
return result
# Sequential: each call gets full CPU
start = time.perf_counter()
cpu_work(10_000_000)
cpu_work(10_000_000)
sequential_time = time.perf_counter() - start
# Threaded: GIL forces serialization + adds overhead
start = time.perf_counter()
t1 = threading.Thread(target=cpu_work, args=(10_000_000,))
t2 = threading.Thread(target=cpu_work, args=(10_000_000,))
t1.start(); t2.start()
t1.join(); t2.join()
threaded_time = time.perf_counter() - start
print(f"Sequential: {sequential_time:.3f}s")
print(f"Threaded: {threaded_time:.3f}s")
# Typical output:
# Sequential: 1.234s
# Threaded: 1.487s ← SLOWER due to GIL contention overhead
Two threads competing for the GIL on CPU-bound work is actually slower than sequential execution - each GIL handoff has overhead, and threads waste cycles waiting.
I/O-Bound: Threading Shines
import threading, time, urllib.request
URLs = [
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/1",
]
results = []
def fetch(url):
with urllib.request.urlopen(url, timeout=10) as r:
results.append(r.status)
# Sequential: each request waits for the previous
start = time.perf_counter()
for url in URLs:
fetch(url)
print(f"Sequential: {time.perf_counter() - start:.1f}s") # ~4.0s
# Threaded: all requests in-flight simultaneously
results.clear()
start = time.perf_counter()
threads = [threading.Thread(target=fetch, args=(url,)) for url in URLs]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded: {time.perf_counter() - start:.1f}s") # ~1.0s
:::warning Threading Does NOT Make CPU-Bound Python Code Faster Adding threads to CPU-bound Python code is one of the most common performance mistakes. Due to the GIL, only one thread executes Python bytecodes at a time. Multiple CPU-bound threads compete for the GIL, slow each other down with lock contention, and produce slower results than sequential code. Always profile before threading CPU-bound work. :::
:::note asyncio Is Not About Parallelism
asyncio is cooperative concurrency on a single thread. There is no parallelism, no GIL contention, and no thread overhead. The event loop runs coroutines one at a time, switching between them at await points. asyncio is excellent for I/O-bound tasks with many concurrent connections (thousands of HTTP requests, WebSocket connections, database queries). It is the wrong tool for CPU-bound work.
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response: # cooperative yield here
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks) # all run concurrently
asyncio.run(main())
:::
Part 5 - Working Around the GIL
Multiprocessing: Separate Processes, Separate GILs
The canonical solution for CPU-bound parallelism in Python is multiprocessing. Each process has its own Python interpreter, its own GIL, and its own memory space. True CPU parallelism is achieved:
from multiprocessing import Pool
import time
def cpu_work(n):
result = 0
for i in range(n):
result += i * i
return result
if __name__ == "__main__":
# Using Pool.map: distribute work across CPU cores
start = time.perf_counter()
with Pool(processes=4) as pool:
results = pool.map(cpu_work, [5_000_000] * 4)
print(f"Multiprocessing: {time.perf_counter() - start:.3f}s")
# On a 4-core machine: ~0.5s vs ~2.0s sequential
concurrent.futures.ProcessPoolExecutor provides a higher-level interface with the same parallelism:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time
def process_chunk(data):
return sum(x * x for x in data)
data = list(range(1_000_000))
chunks = [data[i::4] for i in range(4)] # split into 4 chunks
if __name__ == "__main__":
# ProcessPoolExecutor: real CPU parallelism
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_chunk, chunks))
# ThreadPoolExecutor: good for I/O, not CPU
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_chunk, chunks))
:::tip For CPU-Bound Parallelism: Use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor
These are the correct tools for CPU-bound work in Python. The overhead of process creation is real (100-500ms startup), but it is one-time cost amortized across the workload. For long-running batch jobs, the parallelism gain far exceeds the startup cost. Use ProcessPoolExecutor in new code - it has a cleaner API and integrates with asyncio.
:::
Shared Memory Between Processes
Separate processes cannot share Python objects directly. For communication, use multiprocessing primitives:
from multiprocessing import Process, Value, Array, Manager
import ctypes
# Value: a single shared value (uses shared memory - fast)
def increment_shared(counter, n):
for _ in range(n):
with counter.get_lock():
counter.value += 1
counter = Value(ctypes.c_int, 0)
p1 = Process(target=increment_shared, args=(counter, 100_000))
p2 = Process(target=increment_shared, args=(counter, 100_000))
p1.start(); p2.start()
p1.join(); p2.join()
print(counter.value) # 200,000 - correct with lock
# Array: shared array (fast, uses shared memory)
shared_array = Array(ctypes.c_double, [0.0] * 10)
# Manager: arbitrary Python objects (slower - uses pickling + proxy objects)
with Manager() as manager:
shared_dict = manager.dict()
shared_list = manager.list()
Use Value and Array for performance-critical shared state (they use actual shared memory). Use Manager for convenience when sharing complex Python objects (dicts, lists) at the cost of pickling overhead.
C Extensions That Release the GIL
Many scientific Python libraries release the GIL during expensive computations, allowing true CPU parallelism in threads:
import numpy as np
import threading
import time
# NumPy releases the GIL during array operations
def numpy_work(size):
a = np.random.rand(size, size)
b = np.random.rand(size, size)
return np.dot(a, b) # BLAS routines release the GIL
# These two matrix multiplications run in TRUE parallel
start = time.perf_counter()
t1 = threading.Thread(target=numpy_work, args=(500,))
t2 = threading.Thread(target=numpy_work, args=(500,))
t1.start(); t2.start()
t1.join(); t2.join()
parallel_time = time.perf_counter() - start
start = time.perf_counter()
numpy_work(500)
numpy_work(500)
sequential_time = time.perf_counter() - start
print(f"Parallel (threads + NumPy): {parallel_time:.3f}s")
print(f"Sequential: {sequential_time:.3f}s")
# With NumPy, threaded IS faster - the GIL is irrelevant during BLAS calls
Libraries that release the GIL during heavy operations:
- NumPy - array math, linear algebra (BLAS/LAPACK routines)
- pandas - many operations delegate to NumPy
- Pillow - image encoding/decoding, pixel operations
- lxml - XML parsing
- cryptography - hash operations, encryption
- SQLite (
sqlite3) - query execution releases the GIL
This is why data science workflows using threading with NumPy workloads can achieve genuine parallelism.
Part 6 - Python 3.13 Free-Threaded Mode
The No-GIL Build
Python 3.13 introduced an experimental free-threaded build (PEP 703) - a CPython build with the GIL disabled. It is an opt-in compile flag (--disable-gil), not the default.
import sys
# Check if running in free-threaded mode
print(sys._is_gil_enabled()) # False in free-threaded build, True in standard CPython
Free-threaded Python uses per-object locks and biased reference counting (inspired by the Biased Locking technique from JVM) to make reference counting thread-safe without a global lock.
Current Status (Python 3.13, 2024)
- Available as an experimental opt-in (
python3.13tbinary in some distributions) - Single-threaded code is ~5-10% slower due to per-object locking overhead
- Many C extensions must be updated to work correctly without the GIL
- NumPy, pandas, Cython have ongoing work to support free-threaded builds
- Not recommended for production use as of Python 3.13 - too many ecosystem compatibility issues
- The goal is full stability and ecosystem support by Python 3.15-3.16
What It Means for Your Code
When free-threaded mode becomes stable:
# Today: GIL prevents this from actually running in parallel
# With free-threaded: this WILL run in parallel - and expose all your race conditions
import threading
shared_dict = {}
def update_dict(key, value):
# TODAY: mostly safe due to GIL serialization
# FREE-THREADED: potential race condition without locking
shared_dict[key] = value
# Write thread-safe code NOW - it's correct under the GIL and
# will remain correct in free-threaded Python
The lesson: write thread-safe code regardless of the GIL. Applications that relied on the GIL for implicit serialization will have race conditions in free-threaded Python.
Part 7 - A Production-Correct Concurrent Pattern
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from typing import Callable, TypeVar, Iterable
import time
T = TypeVar("T")
R = TypeVar("R")
def parallel_map(
func: Callable[[T], R],
items: Iterable[T],
*,
mode: str = "auto",
max_workers: int | None = None,
) -> list[R]:
"""
Apply func to each item in parallel, choosing the right executor.
mode="io" → ThreadPoolExecutor (I/O-bound work)
mode="cpu" → ProcessPoolExecutor (CPU-bound work)
mode="auto" → ThreadPoolExecutor (safe default; use "cpu" explicitly)
Production use: API fan-out, batch database queries (io),
image processing, data parsing (cpu).
"""
if mode in ("io", "auto"):
executor_cls = ThreadPoolExecutor
elif mode == "cpu":
executor_cls = ProcessPoolExecutor
else:
raise ValueError(f"mode must be 'io', 'cpu', or 'auto', got {mode!r}")
with executor_cls(max_workers=max_workers) as executor:
return list(executor.map(func, items))
# I/O-bound: fetch user profiles from an API
def fetch_user(user_id: int) -> dict:
time.sleep(0.1) # simulates HTTP request latency
return {"id": user_id, "name": f"User {user_id}"}
users = parallel_map(fetch_user, range(20), mode="io", max_workers=10)
print(f"Fetched {len(users)} users")
# CPU-bound: compress images, parse large JSON files
def heavy_compute(n: int) -> int:
return sum(i * i for i in range(n))
if __name__ == "__main__": # required for ProcessPoolExecutor
results = parallel_map(heavy_compute, [500_000] * 8, mode="cpu")
print(f"Computed {len(results)} results")
Common Mistakes
Mistake 1 - Using Threads for CPU-Bound Work
# Wrong: threading CPU-bound work - often SLOWER than sequential
threads = [threading.Thread(target=cpu_heavy_function) for _ in range(8)]
for t in threads: t.start()
for t in threads: t.join()
# Right: use multiprocessing for CPU-bound
with ProcessPoolExecutor(max_workers=8) as pool:
results = list(pool.map(cpu_heavy_function, work_items))
Mistake 2 - Relying on the GIL for Thread Safety
# Wrong: assumes GIL makes this safe - it does NOT
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1 # 4 bytecodes - not atomic
# Right: use a Lock
class Counter:
def __init__(self):
self.value = 0
self._lock = threading.Lock()
def increment(self):
with self._lock:
self.value += 1
Mistake 3 - Confusing asyncio With Parallelism
# Wrong mental model: asyncio runs things "at the same time"
async def wrong_usage():
# These run CONCURRENTLY but not in PARALLEL
# Only one is running at any given instant
await asyncio.gather(cpu_heavy_coro(), cpu_heavy_coro())
# asyncio does NOT help CPU-bound code
# Right: asyncio for I/O concurrency, multiprocessing for CPU parallelism
async def correct_usage():
# I/O: asyncio shines
await asyncio.gather(fetch_url(url1), fetch_url(url2), fetch_url(url3))
# CPU: offload to process pool
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as pool:
result = await loop.run_in_executor(pool, cpu_heavy_function, data)
Mistake 4 - Forgetting if __name__ == "__main__" for Multiprocessing
# Wrong: on Windows and macOS (spawn start method), this causes infinite recursion
from multiprocessing import Pool
def work(n):
return n * n
with Pool() as pool: # ERROR on Windows/macOS - spawns new interpreter which
results = pool.map(work, range(10)) # re-imports this module, re-runs Pool()
# Right: guard with __name__ == "__main__"
if __name__ == "__main__":
with Pool() as pool:
results = pool.map(work, range(10))
Graded Practice Challenges
Level 1 - Predict the Output
Question 1: What does this print, and why?
import threading
results = []
def worker(n):
results.append(n * n)
threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()
print(sorted(results))
Show Answer
Output: [0, 1, 4, 9, 16] (always, regardless of thread order)
list.append() is thread-safe in CPython because it is a single C-level operation executed while holding the GIL. All five appends complete without data corruption, and sorted() produces the deterministic output. The order of appends is non-deterministic (could be any order), but the final sorted list is always the same 5 values.
Question 2: What does this print?
import sys
print(sys.getswitchinterval())
Show Answer
Output: 0.005
The default switch interval is 5 milliseconds (0.005 seconds). Every 5ms, a running thread checks whether another thread is waiting for the GIL. If so, it releases the GIL, allowing the waiting thread to acquire it.
Question 3: Which executor should you use?
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
# Task A: Download 100 images from an API
# Task B: Resize and compress those 100 images using Pillow
# Task C: Hash the compressed images for deduplication (pure Python hashlib)
# Which executor for A, B, C?
Show Answer
- Task A (download):
ThreadPoolExecutor- I/O-bound; GIL is released during network operations. Threading achieves true concurrency. - Task B (resize with Pillow):
ThreadPoolExecutor- Pillow releases the GIL during image operations. Threading achieves actual CPU parallelism. - Task C (hashlib in Python):
ProcessPoolExecutor- pure Python hashing does not release the GIL; threading would serialize. Use multiprocessing for true parallelism.
Note: hashlib using OpenSSL routines (SHA-256, MD5) actually releases the GIL for large inputs. For small inputs, the overhead may not matter. When in doubt, profile.
Question 4: What is wrong with this code?
import threading
seen = set()
lock = threading.Lock()
def process(item):
if item not in seen: # line A
# ... do expensive work ...
with lock:
seen.add(item) # line B
Show Answer
The check at line A and the add at line B are not atomic. Two threads can both pass the if item not in seen check before either adds to the set. Both then proceed to do "expensive work" for the same item - defeating the deduplication.
Fix: move the entire check-and-add inside the lock:
def process(item):
with lock:
if item in seen:
return
seen.add(item)
# ... do expensive work outside the lock ...
This is the "check-then-act" race condition pattern - one of the most common concurrency bugs.
Question 5: True or False - threading with NumPy matrix multiplication achieves real CPU parallelism.
Show Answer
True. NumPy's matrix multiplication (np.dot, np.matmul) delegates to BLAS routines (OpenBLAS, MKL) which execute in C/Fortran and release the GIL. Two threads calling np.dot simultaneously can execute truly in parallel on separate CPU cores, even in standard CPython. The GIL is not a barrier for C extensions that explicitly release it.
Level 2 - Debug Challenge
Find and fix all issues:
import threading
from concurrent.futures import ThreadPoolExecutor
# Bug 1: shared state without locking
request_count = 0
def handle_request(request_id):
global request_count
request_count += 1 # not thread-safe
return f"handled {request_id}"
# Bug 2: wrong executor type for CPU-bound work
def compress_image(image_data):
# pure Python compression - CPU bound
return bytes(b ^ 0xFF for b in image_data)
with ThreadPoolExecutor(max_workers=8) as executor: # wrong executor
compressed = list(executor.map(compress_image, [b"data"] * 100))
# Bug 3: missing __main__ guard
from multiprocessing import Pool
def cpu_task(n):
return sum(i**2 for i in range(n))
with Pool(4) as pool: # will crash on Windows/macOS
results = pool.map(cpu_task, range(10))
# Bug 4: asyncio misused for CPU-bound work
import asyncio
async def process_all(items):
tasks = [asyncio.create_task(cpu_coroutine(item)) for item in items]
return await asyncio.gather(*tasks)
Show Solution
Bug 1 - Shared counter without a lock:
request_count = 0
request_lock = threading.Lock()
def handle_request(request_id):
global request_count
with request_lock:
request_count += 1
return f"handled {request_id}"
Bug 2 - ThreadPoolExecutor for CPU-bound work:
from concurrent.futures import ProcessPoolExecutor
# CPU-bound work needs ProcessPoolExecutor
if __name__ == "__main__":
with ProcessPoolExecutor(max_workers=8) as executor:
compressed = list(executor.map(compress_image, [b"data"] * 100))
Bug 3 - Missing __main__ guard:
if __name__ == "__main__":
with Pool(4) as pool:
results = pool.map(cpu_task, range(10))
Bug 4 - asyncio for CPU-bound work:
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def process_all(items):
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as pool:
# Offload CPU-bound work to process pool, await completion
tasks = [
loop.run_in_executor(pool, cpu_function, item)
for item in items
]
return await asyncio.gather(*tasks)
Level 3 - Design Challenge
Design a WorkerPool class that:
- Accepts a
modeparameter:"thread"or"process" - Has a
submit(func, *args)method that submits a task and returns aFuture - Has a
map(func, items)method that distributes items across workers and returns results in order - Has a
shutdown()method - Works as a context manager
- For
mode="thread", usesThreadPoolExecutor; formode="process", usesProcessPoolExecutor - Exposes
pool.stats()returning{"submitted": N, "completed": N, "failed": N}
Show Reference Solution
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, Future
from typing import Callable, Iterable, TypeVar, Any
import threading
T = TypeVar("T")
R = TypeVar("R")
class WorkerPool:
"""
Unified interface over ThreadPoolExecutor and ProcessPoolExecutor.
mode="thread" → I/O-bound tasks (HTTP, database, file operations)
mode="process" → CPU-bound tasks (image processing, data transformation)
"""
def __init__(self, mode: str = "thread", max_workers: int | None = None):
if mode not in ("thread", "process"):
raise ValueError(f"mode must be 'thread' or 'process', got {mode!r}")
self._mode = mode
self._max_workers = max_workers
self._executor = None
# Stats tracking - use a lock since submit() may be called from threads
self._stats_lock = threading.Lock()
self._submitted = 0
self._completed = 0
self._failed = 0
def _get_executor(self):
if self._executor is None:
cls = ThreadPoolExecutor if self._mode == "thread" else ProcessPoolExecutor
self._executor = cls(max_workers=self._max_workers)
return self._executor
def _wrap(self, func: Callable, *args) -> Callable:
"""Wrap func to track completion stats."""
def tracked():
try:
result = func(*args)
with self._stats_lock:
self._completed += 1
return result
except Exception:
with self._stats_lock:
self._failed += 1
raise
return tracked
def submit(self, func: Callable[..., R], *args) -> Future:
"""Submit a single task. Returns a Future."""
with self._stats_lock:
self._submitted += 1
return self._get_executor().submit(self._wrap(func, *args))
def map(self, func: Callable[[T], R], items: Iterable[T]) -> list[R]:
"""Distribute items across workers. Returns results in input order."""
items = list(items)
with self._stats_lock:
self._submitted += len(items)
futures = [
self._get_executor().submit(self._wrap(func, item))
for item in items
]
results = []
for future in futures:
try:
results.append(future.result())
except Exception:
with self._stats_lock:
# _wrap already counted the failure, but map re-raises
pass
raise
return results
def stats(self) -> dict:
with self._stats_lock:
return {
"submitted": self._submitted,
"completed": self._completed,
"failed": self._failed,
"pending": self._submitted - self._completed - self._failed,
}
def shutdown(self, wait: bool = True) -> None:
if self._executor is not None:
self._executor.shutdown(wait=wait)
self._executor = None
def __enter__(self):
return self
def __exit__(self, *exc):
self.shutdown(wait=True)
return False
# Usage
if __name__ == "__main__":
import time
def fetch(n):
time.sleep(0.05) # simulates I/O
return n * n
with WorkerPool(mode="thread", max_workers=10) as pool:
results = pool.map(fetch, range(20))
print(results[:5]) # [0, 1, 4, 9, 16]
print(pool.stats()) # {'submitted': 20, 'completed': 20, 'failed': 0, 'pending': 0}
Design decisions:
_wrapintercepts each task to track completion/failure stats without modifying the user's function- Stats use a
threading.Lockbecausesubmit()can be called from multiple threads simultaneously map()collects all futures before iterating - this preserves input ordershutdown()is idempotent - calling it multiple times is safe
Key Takeaways
- The GIL is a mutex in CPython that ensures only one thread executes Python bytecodes at a time - it protects CPython's internal reference counts and memory allocator
- The GIL does not make Python operations atomic;
counter += 1compiles to 4 bytecodes and is a data race under threading - The GIL releases every 5ms (
sys.getswitchinterval()) and during all I/O operations - this is why threading is effective for I/O-bound tasks - Threading CPU-bound Python code is not just unhelpful - it is often slower than sequential due to GIL contention overhead
- For CPU-bound parallelism: use
multiprocessing.Poolorconcurrent.futures.ProcessPoolExecutor- separate processes have separate GILs - C extensions like NumPy and Pillow release the GIL during heavy operations - threading with NumPy achieves genuine CPU parallelism
asynciois cooperative concurrency on a single thread - it is not parallelism and does not help CPU-bound codethreading.Lockis required for any shared mutable state your Python code reads and writes concurrently - the GIL is not a substitute- Python 3.13 introduced an experimental free-threaded build (no GIL) - not production-ready yet, but signals the direction of the language
- Write thread-safe code regardless of the GIL - applications relying on GIL-as-implicit-lock will break in free-threaded Python
What's Next
Lesson 05 covers reference counting - CPython's primary memory management mechanism. You will learn how ob_refcnt works, why sys.getrefcount() always returns one more than you expect, how reference cycles defeat refcounting, and why del x does not immediately destroy an object.
