Master Python's Global Interpreter Lock at engineering depth - what the GIL protects, why counter += 1 is not atomic, the check interval, I/O vs CPU-bound threading, multiprocessing, C extensions that release the GIL, and Python 3.13 free-threaded mode.

How does python Global Interpreter Lock work in practice?

The GIL Explained - What It Is, What It Isn't, and How to Work Around It covers python GIL, python Global Interpreter Lock, python threading GIL from first principles with code examples. Free lesson at https://engineersofai.com/docs/python/python-intermediate/python-internals/gil-explained

What is the difference between python GIL and python threading GIL?

See the full breakdown at https://engineersofai.com/docs/python/python-intermediate/python-internals/gil-explained

The GIL Explained - What It Is, What It Isn't, and How to Work Around It

Reading time: ~30 minutes | Level: Intermediate → Engineering

Before reading further, predict the output:

import threading

counter = 0

def increment():
    global counter
    for _ in range(1_000_000):
        counter += 1

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()

print(counter)  # ?

Show Answer

The output is not 2,000,000. It's a non-deterministic number less than 2,000,000 - something like 1,387,241 or 1,823,904.

Most engineers expect the GIL to protect this. The GIL does prevent two threads from executing Python bytecodes simultaneously. But counter += 1 is not one bytecode - it compiles to four:

LOAD_GLOBAL   counter       # read counter's current value
LOAD_CONST    1             # push 1 onto the stack
BINARY_OP     +             # add them
STORE_GLOBAL  counter       # write the result back

The GIL can be released between any two of these bytecodes. Thread 1 can read counter = 500, then lose the GIL to Thread 2 which also reads counter = 500, increments to 501, and writes it back. Thread 1 then resumes with its stale value of 500, increments to 501, and overwrites Thread 2's update. One increment is silently lost.

The GIL is not a substitute for application-level locking.

Now consider: this is one of the most misunderstood aspects of Python. Engineers build multithreaded services expecting the GIL to protect shared state, only to discover data races in production under load. Understanding the GIL at bytecode depth - what it guards, what it does not, when it releases, and how to achieve real parallelism - is essential for writing correct concurrent Python.

What You Will Learn

What the GIL is: a mutex protecting CPython's internal state
Why it exists: CPython's reference counting is not thread-safe without it
What the GIL does NOT protect: your application-level data structures
How counter += 1 desugars to 4 bytecodes and why that matters
The check interval: sys.getswitchinterval() and how to tune it
Why I/O releases the GIL and why threading works for I/O-bound tasks
CPU-bound vs I/O-bound: the GIL is irrelevant for one and harmful for the other
multiprocessing: separate processes, separate GILs, real CPU parallelism
C extensions (NumPy, pandas, Pillow) that release the GIL for true parallelism
Python 3.13 free-threaded mode: current status and tradeoffs

Prerequisites

Lesson 02: Bytecode Inspection - you need to understand that Python code compiles to bytecodes
Lesson 03: Disassembly with dis - reading bytecode output
Familiarity with threading.Thread basics

Part 1 - What the GIL Is

The GIL Defined

The Global Interpreter Lock (GIL) is a mutex - a mutual exclusion lock - that CPython acquires before executing any Python bytecode and releases under specific conditions. Only one thread can hold the GIL at a time. Only the thread holding the GIL can execute Python bytecodes.

The GIL is not a Python language feature. It is an implementation detail of CPython - the reference interpreter written in C. Other Python implementations (Jython, IronPython, PyPy with STM) have different approaches.

Why the GIL Exists

CPython's memory management is built on reference counting. Every Python object carries a reference count (ob_refcnt). When you assign a variable, the count increments. When the variable goes out of scope, it decrements. When the count reaches zero, the object is freed.

Reference counting requires reads and writes to ob_refcnt on every object access. Without a global lock, two threads modifying the same object's reference count simultaneously would corrupt it - leading to use-after-free bugs, double frees, and memory corruption at the C level.

# Every one of these operations touches ob_refcnt internally
a = some_object       # ob_refcnt += 1
b = a                 # ob_refcnt += 1
del a                 # ob_refcnt -= 1; if 0: free memory
result = func(b)      # ob_refcnt += 1 (passing b increments it)

The GIL ensures these refcount operations are serialized. It also protects CPython's memory allocator, the bytecode execution loop, and internal data structures like dictionaries and lists from concurrent modification at the C level.

What the GIL Protects

The GIL protects:

CPython's internal reference counts - the ob_refcnt field on every PyObject
CPython's memory allocator - pymalloc is not thread-safe without external serialization
CPython's internal data structures - the bytecode interpreter loop, import machinery, sys.modules
Certain Python built-in operations - list .append() is thread-safe because it's a single C-level operation that happens to be atomic under the GIL

What the GIL Does NOT Protect

The GIL does not protect:

Your application-level data structures - dictionaries, lists, counters, flags you write in Python
Multi-step Python operations - any operation that compiles to more than one bytecode
Logic that spans multiple Python statements - check-then-act patterns, read-modify-write

This is the core misunderstanding. Developers see "GIL" and assume thread safety. They are wrong for anything beyond single-bytecode operations.

Part 2 - The GIL Release Points

The Check Interval

The GIL is not held indefinitely. CPython releases it periodically to give other threads a chance to run. The interval is controlled by sys.getswitchinterval():

import sys

print(sys.getswitchinterval())   # 0.005 - default 5 milliseconds

# You can change it (rarely a good idea in production)
sys.setswitchinterval(0.001)     # 1ms - more frequent switching
sys.setswitchinterval(0.1)       # 100ms - less frequent switching

Every 5ms (by default), the executing thread checks if another thread is waiting for the GIL. If so, the current thread releases the GIL, allowing the other thread to acquire it and execute.

Before Python 3.2, the check interval was measured in bytecodes (every 100 bytecodes). Python 3.2 switched to time-based intervals. The time-based approach is more predictable and reduces contention on I/O-heavy workloads.

I/O Operations Release the GIL

The most important GIL release point is I/O. Any time a Python thread performs a blocking I/O operation - reading from a file, making a network request, waiting on a socket - it releases the GIL before the system call and reacquires it after:

import threading
import urllib.request

# Both threads release the GIL during the HTTP request
# They execute truly concurrently at the OS level
def fetch(url):
    with urllib.request.urlopen(url) as response:   # GIL released here
        return response.read()

t1 = threading.Thread(target=fetch, args=("https://httpbin.org/delay/1",))
t2 = threading.Thread(target=fetch, args=("https://httpbin.org/delay/1",))

t1.start(); t2.start()
t1.join(); t2.join()
# Takes ~1 second, not ~2 seconds - true concurrent I/O

During the urlopen() call, Thread 1 releases the GIL and blocks in the kernel waiting for the network. Thread 2 acquires the GIL and starts its own request. Both requests are in-flight simultaneously at the OS level. This is why threading works well for I/O-bound tasks.

The GIL and `time.sleep()`

time.sleep() also releases the GIL:

import threading, time

def worker(n):
    time.sleep(1)   # GIL released during sleep - other threads run
    print(f"Worker {n} done")

threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()
# All 5 workers sleep concurrently - total time ~1s, not ~5s

Part 3 - Why `counter += 1` Loses Updates

Bytecode-Level Analysis

Let's disassemble the increment function to see exactly what bytecodes run:

import dis

def increment():
    global counter
    counter += 1

dis.dis(increment)

Output (Python 3.12):

  3           0 LOAD_GLOBAL    0 (counter)
              2 LOAD_CONST     1 (1)
              4 BINARY_OP      0 (+)
              6 STORE_GLOBAL   0 (counter)
              8 RETURN_CONST   0 (None)

Four bytecodes execute sequentially. The GIL can be released between any two:

Thread 1 read 500, lost the GIL, Thread 2 read the same 500, both computed 501, and both wrote 501. One increment was lost. This can happen anywhere the GIL switches between LOAD_GLOBAL and STORE_GLOBAL.

The Fix: `threading.Lock`

To make the counter thread-safe, wrap the read-modify-write in a threading.Lock:

import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    for _ in range(1_000_000):
        with lock:
            counter += 1   # only one thread executes this at a time

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join(); t2.join()

print(counter)   # always 2,000,000

Or use threading.local() for per-thread state, or redesign to avoid shared mutable state entirely.

:::danger The GIL Does NOT Protect Your Application-Level Data Structures A threading.Lock is required for any shared mutable state your Python code reads and writes across threads. The GIL only protects CPython internals. counter += 1, dict[key] = value after a check, any multi-step operation - these are all race conditions without a Lock.

# This looks safe but is NOT - two threads can both pass the check
# before either writes, leading to duplicate processing
if key not in results:          # LOAD_GLOBAL, BINARY_OP...
    results[key] = compute(key) # STORE_SUBSCR

# Safe version
with lock:
    if key not in results:
        results[key] = compute(key)

:::

Part 4 - CPU-Bound vs I/O-Bound

The Fundamental Split

The GIL's impact depends entirely on what your program spends time doing:

CPU-Bound: Threading Makes It Worse

import threading, time

def cpu_work(n):
    """Pure CPU - no I/O, no sleep."""
    result = 0
    for i in range(n):
        result += i * i
    return result

# Sequential: each call gets full CPU
start = time.perf_counter()
cpu_work(10_000_000)
cpu_work(10_000_000)
sequential_time = time.perf_counter() - start

# Threaded: GIL forces serialization + adds overhead
start = time.perf_counter()
t1 = threading.Thread(target=cpu_work, args=(10_000_000,))
t2 = threading.Thread(target=cpu_work, args=(10_000_000,))
t1.start(); t2.start()
t1.join(); t2.join()
threaded_time = time.perf_counter() - start

print(f"Sequential: {sequential_time:.3f}s")
print(f"Threaded:   {threaded_time:.3f}s")
# Typical output:
# Sequential: 1.234s
# Threaded:   1.487s   ← SLOWER due to GIL contention overhead

Two threads competing for the GIL on CPU-bound work is actually slower than sequential execution - each GIL handoff has overhead, and threads waste cycles waiting.

I/O-Bound: Threading Shines

import threading, time, urllib.request

URLs = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/1",
]

results = []

def fetch(url):
    with urllib.request.urlopen(url, timeout=10) as r:
        results.append(r.status)

# Sequential: each request waits for the previous
start = time.perf_counter()
for url in URLs:
    fetch(url)
print(f"Sequential: {time.perf_counter() - start:.1f}s")   # ~4.0s

# Threaded: all requests in-flight simultaneously
results.clear()
start = time.perf_counter()
threads = [threading.Thread(target=fetch, args=(url,)) for url in URLs]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threaded:   {time.perf_counter() - start:.1f}s")   # ~1.0s

:::warning Threading Does NOT Make CPU-Bound Python Code Faster Adding threads to CPU-bound Python code is one of the most common performance mistakes. Due to the GIL, only one thread executes Python bytecodes at a time. Multiple CPU-bound threads compete for the GIL, slow each other down with lock contention, and produce slower results than sequential code. Always profile before threading CPU-bound work. :::

:::note asyncio Is Not About Parallelism asyncio is cooperative concurrency on a single thread. There is no parallelism, no GIL contention, and no thread overhead. The event loop runs coroutines one at a time, switching between them at await points. asyncio is excellent for I/O-bound tasks with many concurrent connections (thousands of HTTP requests, WebSocket connections, database queries). It is the wrong tool for CPU-bound work.

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:   # cooperative yield here
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)   # all run concurrently

asyncio.run(main())

:::

Part 5 - Working Around the GIL

Multiprocessing: Separate Processes, Separate GILs

The canonical solution for CPU-bound parallelism in Python is multiprocessing. Each process has its own Python interpreter, its own GIL, and its own memory space. True CPU parallelism is achieved:

from multiprocessing import Pool
import time

def cpu_work(n):
    result = 0
    for i in range(n):
        result += i * i
    return result

if __name__ == "__main__":
    # Using Pool.map: distribute work across CPU cores
    start = time.perf_counter()
    with Pool(processes=4) as pool:
        results = pool.map(cpu_work, [5_000_000] * 4)
    print(f"Multiprocessing: {time.perf_counter() - start:.3f}s")
    # On a 4-core machine: ~0.5s vs ~2.0s sequential

concurrent.futures.ProcessPoolExecutor provides a higher-level interface with the same parallelism:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time

def process_chunk(data):
    return sum(x * x for x in data)

data = list(range(1_000_000))
chunks = [data[i::4] for i in range(4)]   # split into 4 chunks

if __name__ == "__main__":
    # ProcessPoolExecutor: real CPU parallelism
    with ProcessPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(process_chunk, chunks))

    # ThreadPoolExecutor: good for I/O, not CPU
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(process_chunk, chunks))

:::tip For CPU-Bound Parallelism: Use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor These are the correct tools for CPU-bound work in Python. The overhead of process creation is real (100-500ms startup), but it is one-time cost amortized across the workload. For long-running batch jobs, the parallelism gain far exceeds the startup cost. Use ProcessPoolExecutor in new code - it has a cleaner API and integrates with asyncio. :::

Shared Memory Between Processes

Separate processes cannot share Python objects directly. For communication, use multiprocessing primitives:

from multiprocessing import Process, Value, Array, Manager
import ctypes

# Value: a single shared value (uses shared memory - fast)
def increment_shared(counter, n):
    for _ in range(n):
        with counter.get_lock():
            counter.value += 1

counter = Value(ctypes.c_int, 0)
p1 = Process(target=increment_shared, args=(counter, 100_000))
p2 = Process(target=increment_shared, args=(counter, 100_000))
p1.start(); p2.start()
p1.join(); p2.join()
print(counter.value)   # 200,000 - correct with lock

# Array: shared array (fast, uses shared memory)
shared_array = Array(ctypes.c_double, [0.0] * 10)

# Manager: arbitrary Python objects (slower - uses pickling + proxy objects)
with Manager() as manager:
    shared_dict = manager.dict()
    shared_list = manager.list()

Use Value and Array for performance-critical shared state (they use actual shared memory). Use Manager for convenience when sharing complex Python objects (dicts, lists) at the cost of pickling overhead.

C Extensions That Release the GIL

Many scientific Python libraries release the GIL during expensive computations, allowing true CPU parallelism in threads:

import numpy as np
import threading
import time

# NumPy releases the GIL during array operations
def numpy_work(size):
    a = np.random.rand(size, size)
    b = np.random.rand(size, size)
    return np.dot(a, b)   # BLAS routines release the GIL

# These two matrix multiplications run in TRUE parallel
start = time.perf_counter()
t1 = threading.Thread(target=numpy_work, args=(500,))
t2 = threading.Thread(target=numpy_work, args=(500,))
t1.start(); t2.start()
t1.join(); t2.join()
parallel_time = time.perf_counter() - start

start = time.perf_counter()
numpy_work(500)
numpy_work(500)
sequential_time = time.perf_counter() - start

print(f"Parallel (threads + NumPy): {parallel_time:.3f}s")
print(f"Sequential:                 {sequential_time:.3f}s")
# With NumPy, threaded IS faster - the GIL is irrelevant during BLAS calls

Libraries that release the GIL during heavy operations:

NumPy - array math, linear algebra (BLAS/LAPACK routines)
pandas - many operations delegate to NumPy
Pillow - image encoding/decoding, pixel operations
lxml - XML parsing
cryptography - hash operations, encryption
SQLite (sqlite3) - query execution releases the GIL

This is why data science workflows using threading with NumPy workloads can achieve genuine parallelism.

Part 6 - Python 3.13 Free-Threaded Mode

The No-GIL Build

Python 3.13 introduced an experimental free-threaded build (PEP 703) - a CPython build with the GIL disabled. It is an opt-in compile flag (--disable-gil), not the default.

import sys

# Check if running in free-threaded mode
print(sys._is_gil_enabled())   # False in free-threaded build, True in standard CPython

Free-threaded Python uses per-object locks and biased reference counting (inspired by the Biased Locking technique from JVM) to make reference counting thread-safe without a global lock.

Current Status (Python 3.13, 2024)

Available as an experimental opt-in (python3.13t binary in some distributions)
Single-threaded code is ~5-10% slower due to per-object locking overhead
Many C extensions must be updated to work correctly without the GIL
NumPy, pandas, Cython have ongoing work to support free-threaded builds
Not recommended for production use as of Python 3.13 - too many ecosystem compatibility issues
The goal is full stability and ecosystem support by Python 3.15-3.16

What It Means for Your Code

When free-threaded mode becomes stable:

# Today: GIL prevents this from actually running in parallel
# With free-threaded: this WILL run in parallel - and expose all your race conditions

import threading

shared_dict = {}

def update_dict(key, value):
    # TODAY: mostly safe due to GIL serialization
    # FREE-THREADED: potential race condition without locking
    shared_dict[key] = value

# Write thread-safe code NOW - it's correct under the GIL and
# will remain correct in free-threaded Python

The lesson: write thread-safe code regardless of the GIL. Applications that relied on the GIL for implicit serialization will have race conditions in free-threaded Python.

Part 7 - A Production-Correct Concurrent Pattern

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from typing import Callable, TypeVar, Iterable
import time

T = TypeVar("T")
R = TypeVar("R")


def parallel_map(
    func: Callable[[T], R],
    items: Iterable[T],
    *,
    mode: str = "auto",
    max_workers: int | None = None,
) -> list[R]:
    """
    Apply func to each item in parallel, choosing the right executor.

    mode="io"       → ThreadPoolExecutor (I/O-bound work)
    mode="cpu"      → ProcessPoolExecutor (CPU-bound work)
    mode="auto"     → ThreadPoolExecutor (safe default; use "cpu" explicitly)

    Production use: API fan-out, batch database queries (io),
                    image processing, data parsing (cpu).
    """
    if mode in ("io", "auto"):
        executor_cls = ThreadPoolExecutor
    elif mode == "cpu":
        executor_cls = ProcessPoolExecutor
    else:
        raise ValueError(f"mode must be 'io', 'cpu', or 'auto', got {mode!r}")

    with executor_cls(max_workers=max_workers) as executor:
        return list(executor.map(func, items))


# I/O-bound: fetch user profiles from an API
def fetch_user(user_id: int) -> dict:
    time.sleep(0.1)   # simulates HTTP request latency
    return {"id": user_id, "name": f"User {user_id}"}

users = parallel_map(fetch_user, range(20), mode="io", max_workers=10)
print(f"Fetched {len(users)} users")

# CPU-bound: compress images, parse large JSON files
def heavy_compute(n: int) -> int:
    return sum(i * i for i in range(n))

if __name__ == "__main__":   # required for ProcessPoolExecutor
    results = parallel_map(heavy_compute, [500_000] * 8, mode="cpu")
    print(f"Computed {len(results)} results")

Common Mistakes

Mistake 1 - Using Threads for CPU-Bound Work

# Wrong: threading CPU-bound work - often SLOWER than sequential
threads = [threading.Thread(target=cpu_heavy_function) for _ in range(8)]
for t in threads: t.start()
for t in threads: t.join()

# Right: use multiprocessing for CPU-bound
with ProcessPoolExecutor(max_workers=8) as pool:
    results = list(pool.map(cpu_heavy_function, work_items))

Mistake 2 - Relying on the GIL for Thread Safety

# Wrong: assumes GIL makes this safe - it does NOT
class Counter:
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1   # 4 bytecodes - not atomic

# Right: use a Lock
class Counter:
    def __init__(self):
        self.value = 0
        self._lock = threading.Lock()

    def increment(self):
        with self._lock:
            self.value += 1

Mistake 3 - Confusing asyncio With Parallelism

# Wrong mental model: asyncio runs things "at the same time"
async def wrong_usage():
    # These run CONCURRENTLY but not in PARALLEL
    # Only one is running at any given instant
    await asyncio.gather(cpu_heavy_coro(), cpu_heavy_coro())
    # asyncio does NOT help CPU-bound code

# Right: asyncio for I/O concurrency, multiprocessing for CPU parallelism
async def correct_usage():
    # I/O: asyncio shines
    await asyncio.gather(fetch_url(url1), fetch_url(url2), fetch_url(url3))

    # CPU: offload to process pool
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, cpu_heavy_function, data)

Mistake 4 - Forgetting `if name == "main"` for Multiprocessing

# Wrong: on Windows and macOS (spawn start method), this causes infinite recursion
from multiprocessing import Pool

def work(n):
    return n * n

with Pool() as pool:   # ERROR on Windows/macOS - spawns new interpreter which
    results = pool.map(work, range(10))  # re-imports this module, re-runs Pool()

# Right: guard with __name__ == "__main__"
if __name__ == "__main__":
    with Pool() as pool:
        results = pool.map(work, range(10))

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: What does this print, and why?

import threading

results = []

def worker(n):
    results.append(n * n)

threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()

print(sorted(results))

Show Answer

Output: [0, 1, 4, 9, 16] (always, regardless of thread order)

list.append() is thread-safe in CPython because it is a single C-level operation executed while holding the GIL. All five appends complete without data corruption, and sorted() produces the deterministic output. The order of appends is non-deterministic (could be any order), but the final sorted list is always the same 5 values.

Question 2: What does this print?

import sys
print(sys.getswitchinterval())

Show Answer

Output: 0.005

The default switch interval is 5 milliseconds (0.005 seconds). Every 5ms, a running thread checks whether another thread is waiting for the GIL. If so, it releases the GIL, allowing the waiting thread to acquire it.

Question 3: Which executor should you use?

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# Task A: Download 100 images from an API
# Task B: Resize and compress those 100 images using Pillow
# Task C: Hash the compressed images for deduplication (pure Python hashlib)

# Which executor for A, B, C?

Show Answer

Task A (download): ThreadPoolExecutor - I/O-bound; GIL is released during network operations. Threading achieves true concurrency.
Task B (resize with Pillow): ThreadPoolExecutor - Pillow releases the GIL during image operations. Threading achieves actual CPU parallelism.
Task C (hashlib in Python): ProcessPoolExecutor - pure Python hashing does not release the GIL; threading would serialize. Use multiprocessing for true parallelism.

Note: hashlib using OpenSSL routines (SHA-256, MD5) actually releases the GIL for large inputs. For small inputs, the overhead may not matter. When in doubt, profile.

Question 4: What is wrong with this code?

import threading

seen = set()
lock = threading.Lock()

def process(item):
    if item not in seen:      # line A
        # ... do expensive work ...
        with lock:
            seen.add(item)    # line B

Show Answer

The check at line A and the add at line B are not atomic. Two threads can both pass the if item not in seen check before either adds to the set. Both then proceed to do "expensive work" for the same item - defeating the deduplication.

Fix: move the entire check-and-add inside the lock:

def process(item):
    with lock:
        if item in seen:
            return
        seen.add(item)
    # ... do expensive work outside the lock ...

This is the "check-then-act" race condition pattern - one of the most common concurrency bugs.

Question 5: True or False - threading with NumPy matrix multiplication achieves real CPU parallelism.

Show Answer

True. NumPy's matrix multiplication (np.dot, np.matmul) delegates to BLAS routines (OpenBLAS, MKL) which execute in C/Fortran and release the GIL. Two threads calling np.dot simultaneously can execute truly in parallel on separate CPU cores, even in standard CPython. The GIL is not a barrier for C extensions that explicitly release it.

Level 2 - Debug Challenge

Find and fix all issues:

import threading
from concurrent.futures import ThreadPoolExecutor

# Bug 1: shared state without locking
request_count = 0

def handle_request(request_id):
    global request_count
    request_count += 1    # not thread-safe
    return f"handled {request_id}"

# Bug 2: wrong executor type for CPU-bound work
def compress_image(image_data):
    # pure Python compression - CPU bound
    return bytes(b ^ 0xFF for b in image_data)

with ThreadPoolExecutor(max_workers=8) as executor:  # wrong executor
    compressed = list(executor.map(compress_image, [b"data"] * 100))

# Bug 3: missing __main__ guard
from multiprocessing import Pool

def cpu_task(n):
    return sum(i**2 for i in range(n))

with Pool(4) as pool:   # will crash on Windows/macOS
    results = pool.map(cpu_task, range(10))

# Bug 4: asyncio misused for CPU-bound work
import asyncio

async def process_all(items):
    tasks = [asyncio.create_task(cpu_coroutine(item)) for item in items]
    return await asyncio.gather(*tasks)

Show Solution

Bug 1 - Shared counter without a lock:

request_count = 0
request_lock = threading.Lock()

def handle_request(request_id):
    global request_count
    with request_lock:
        request_count += 1
    return f"handled {request_id}"

Bug 2 - ThreadPoolExecutor for CPU-bound work:

from concurrent.futures import ProcessPoolExecutor

# CPU-bound work needs ProcessPoolExecutor
if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=8) as executor:
        compressed = list(executor.map(compress_image, [b"data"] * 100))

Bug 3 - Missing __main__ guard:

if __name__ == "__main__":
    with Pool(4) as pool:
        results = pool.map(cpu_task, range(10))

Bug 4 - asyncio for CPU-bound work:

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def process_all(items):
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor() as pool:
        # Offload CPU-bound work to process pool, await completion
        tasks = [
            loop.run_in_executor(pool, cpu_function, item)
            for item in items
        ]
        return await asyncio.gather(*tasks)

Level 3 - Design Challenge

Design a WorkerPool class that:

Accepts a mode parameter: "thread" or "process"
Has a submit(func, *args) method that submits a task and returns a Future
Has a map(func, items) method that distributes items across workers and returns results in order
Has a shutdown() method
Works as a context manager
For mode="thread", uses ThreadPoolExecutor; for mode="process", uses ProcessPoolExecutor
Exposes pool.stats() returning {"submitted": N, "completed": N, "failed": N}

Show Reference Solution

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, Future
from typing import Callable, Iterable, TypeVar, Any
import threading

T = TypeVar("T")
R = TypeVar("R")


class WorkerPool:
    """
    Unified interface over ThreadPoolExecutor and ProcessPoolExecutor.

    mode="thread"  → I/O-bound tasks (HTTP, database, file operations)
    mode="process" → CPU-bound tasks (image processing, data transformation)
    """

    def __init__(self, mode: str = "thread", max_workers: int | None = None):
        if mode not in ("thread", "process"):
            raise ValueError(f"mode must be 'thread' or 'process', got {mode!r}")

        self._mode = mode
        self._max_workers = max_workers
        self._executor = None

        # Stats tracking - use a lock since submit() may be called from threads
        self._stats_lock = threading.Lock()
        self._submitted = 0
        self._completed = 0
        self._failed = 0

    def _get_executor(self):
        if self._executor is None:
            cls = ThreadPoolExecutor if self._mode == "thread" else ProcessPoolExecutor
            self._executor = cls(max_workers=self._max_workers)
        return self._executor

    def _wrap(self, func: Callable, *args) -> Callable:
        """Wrap func to track completion stats."""
        def tracked():
            try:
                result = func(*args)
                with self._stats_lock:
                    self._completed += 1
                return result
            except Exception:
                with self._stats_lock:
                    self._failed += 1
                raise
        return tracked

    def submit(self, func: Callable[..., R], *args) -> Future:
        """Submit a single task. Returns a Future."""
        with self._stats_lock:
            self._submitted += 1
        return self._get_executor().submit(self._wrap(func, *args))

    def map(self, func: Callable[[T], R], items: Iterable[T]) -> list[R]:
        """Distribute items across workers. Returns results in input order."""
        items = list(items)
        with self._stats_lock:
            self._submitted += len(items)

        futures = [
            self._get_executor().submit(self._wrap(func, item))
            for item in items
        ]

        results = []
        for future in futures:
            try:
                results.append(future.result())
            except Exception:
                with self._stats_lock:
                    # _wrap already counted the failure, but map re-raises
                    pass
                raise
        return results

    def stats(self) -> dict:
        with self._stats_lock:
            return {
                "submitted": self._submitted,
                "completed": self._completed,
                "failed": self._failed,
                "pending": self._submitted - self._completed - self._failed,
            }

    def shutdown(self, wait: bool = True) -> None:
        if self._executor is not None:
            self._executor.shutdown(wait=wait)
            self._executor = None

    def __enter__(self):
        return self

    def __exit__(self, *exc):
        self.shutdown(wait=True)
        return False


# Usage
if __name__ == "__main__":
    import time

    def fetch(n):
        time.sleep(0.05)   # simulates I/O
        return n * n

    with WorkerPool(mode="thread", max_workers=10) as pool:
        results = pool.map(fetch, range(20))
        print(results[:5])       # [0, 1, 4, 9, 16]
        print(pool.stats())      # {'submitted': 20, 'completed': 20, 'failed': 0, 'pending': 0}

Design decisions:

_wrap intercepts each task to track completion/failure stats without modifying the user's function
Stats use a threading.Lock because submit() can be called from multiple threads simultaneously
map() collects all futures before iterating - this preserves input order
shutdown() is idempotent - calling it multiple times is safe

Key Takeaways

The GIL is a mutex in CPython that ensures only one thread executes Python bytecodes at a time - it protects CPython's internal reference counts and memory allocator
The GIL does not make Python operations atomic; counter += 1 compiles to 4 bytecodes and is a data race under threading
The GIL releases every 5ms (sys.getswitchinterval()) and during all I/O operations - this is why threading is effective for I/O-bound tasks
Threading CPU-bound Python code is not just unhelpful - it is often slower than sequential due to GIL contention overhead
For CPU-bound parallelism: use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor - separate processes have separate GILs
C extensions like NumPy and Pillow release the GIL during heavy operations - threading with NumPy achieves genuine CPU parallelism
asyncio is cooperative concurrency on a single thread - it is not parallelism and does not help CPU-bound code
threading.Lock is required for any shared mutable state your Python code reads and writes concurrently - the GIL is not a substitute
Python 3.13 introduced an experimental free-threaded build (no GIL) - not production-ready yet, but signals the direction of the language
Write thread-safe code regardless of the GIL - applications relying on GIL-as-implicit-lock will break in free-threaded Python

What's Next

Lesson 05 covers reference counting - CPython's primary memory management mechanism. You will learn how ob_refcnt works, why sys.getrefcount() always returns one more than you expect, how reference cycles defeat refcounting, and why del x does not immediately destroy an object.

What You Will Learn​

Prerequisites​

Part 1 - What the GIL Is​

The GIL Defined​

Why the GIL Exists​

What the GIL Protects​

What the GIL Does NOT Protect​

Part 2 - The GIL Release Points​

The Check Interval​

I/O Operations Release the GIL​

The GIL and time.sleep()​

Part 3 - Why counter += 1 Loses Updates​

Bytecode-Level Analysis​

The Fix: threading.Lock​

Part 4 - CPU-Bound vs I/O-Bound​

The Fundamental Split​

CPU-Bound: Threading Makes It Worse​

I/O-Bound: Threading Shines​

Part 5 - Working Around the GIL​

Multiprocessing: Separate Processes, Separate GILs​

Shared Memory Between Processes​

C Extensions That Release the GIL​

Part 6 - Python 3.13 Free-Threaded Mode​

The No-GIL Build​

Current Status (Python 3.13, 2024)​

What It Means for Your Code​

Part 7 - A Production-Correct Concurrent Pattern​

Common Mistakes​

Mistake 1 - Using Threads for CPU-Bound Work​

Mistake 2 - Relying on the GIL for Thread Safety​

Mistake 3 - Confusing asyncio With Parallelism​

Mistake 4 - Forgetting if __name__ == "__main__" for Multiprocessing​

Graded Practice Challenges​

Level 1 - Predict the Output​

Level 2 - Debug Challenge​

Level 3 - Design Challenge​

Key Takeaways​

What's Next​

What You Will Learn

Prerequisites

Part 1 - What the GIL Is

The GIL Defined

Why the GIL Exists

What the GIL Protects

What the GIL Does NOT Protect

Part 2 - The GIL Release Points

The Check Interval

I/O Operations Release the GIL

The GIL and `time.sleep()`

Part 3 - Why `counter += 1` Loses Updates

Bytecode-Level Analysis

The Fix: `threading.Lock`

Part 4 - CPU-Bound vs I/O-Bound

The Fundamental Split

CPU-Bound: Threading Makes It Worse

I/O-Bound: Threading Shines

Part 5 - Working Around the GIL

Multiprocessing: Separate Processes, Separate GILs

Shared Memory Between Processes

C Extensions That Release the GIL

Part 6 - Python 3.13 Free-Threaded Mode

The No-GIL Build

Current Status (Python 3.13, 2024)

What It Means for Your Code

Part 7 - A Production-Correct Concurrent Pattern

Common Mistakes

Mistake 1 - Using Threads for CPU-Bound Work

Mistake 2 - Relying on the GIL for Thread Safety

Mistake 3 - Confusing asyncio With Parallelism

Mistake 4 - Forgetting `if name == "main"` for Multiprocessing

Graded Practice Challenges

Level 1 - Predict the Output

Level 2 - Debug Challenge

Level 3 - Design Challenge

Key Takeaways

What's Next