Skip to main content

Python The GIL Explained Practice Problems & Exercises

Practice: The GIL Explained

11 problems4 Easy4 Medium3 Hard45–60 min
← Back to lesson

Easy

#1What Does the GIL Actually Protect?Easy
GILbasicsthread-safety

Predict which statements about the GIL are True. Understanding what the GIL does and does not protect is the foundation of safe concurrent Python.

Python
# Statement evaluations — predict True or False for each

# Statement 1: The GIL ensures that CPython's internal reference counts
# are updated atomically, preventing corruption of the interpreter itself.
statement_1 = True

# Statement 2: Because of the GIL, only one thread can execute Python
# bytecodes at any given moment in a single CPython process.
statement_2 = True

# Statement 3: The GIL makes list.append() and dict[key] = value
# fully thread-safe for your application-level data — no additional
# locking is needed when multiple threads share a list or dict.
statement_3 = False

print(statement_1)
print(statement_2)
print(statement_3)
Solution
True
True
False

Breaking down each statement:

Statement 1 — True: The GIL is a mutex (mutual exclusion lock) on the entire CPython interpreter. Its primary historical motivation was protecting CPython's reference counting system. Every Python object has an ob_refcnt field in C. Without the GIL, two threads could simultaneously try to decrement the same object's reference count, producing a race condition that corrupts interpreter state and causes segfaults.

Statement 2 — True: The GIL ensures only one thread holds it at a time, and a thread must hold the GIL to execute Python bytecodes. There is no parallel execution of Python bytecodes in CPython (without the 3.13 free-threaded build).

Statement 3 — False: The GIL operates at bytecode granularity, not at Python expression granularity. Even though only one thread runs at a time, the GIL is released and re-acquired between bytecodes. A high-level operation like list.append() compiles to multiple bytecodes. Another thread can run between any two of them, observing intermediate state.

The practical rule: The GIL protects CPython's C-level interpreter state. It does NOT protect your Python-level objects. Always use threading.Lock, threading.RLock, or thread-safe data structures like queue.Queue when sharing mutable state across threads.

Expected Output
True\nTrue\nFalse
Hints

Hint 1: The GIL protects CPython internals — specifically reference counting data structures.

Hint 2: The GIL does NOT protect your application-level Python objects like lists or dicts.

Hint 3: Only one thread can execute Python bytecodes at a time — but the GIL can release between any two bytecodes.


#2sys.getswitchinterval() — The GIL Check IntervalEasy
GILcheck-intervalsys.getswitchinterval

Inspect and tune the GIL switch interval. The switch interval controls how often the GIL is offered to other threads.

Python
import sys

# Check the default switch interval
interval = sys.getswitchinterval()
print(interval)

# Verify it's the standard default
print(interval == 0.005)

# Lower the interval to make the GIL switch more aggressively
# (threads get more preemption chances — can improve I/O throughput
#  but increases context-switch overhead)
sys.setswitchinterval(0.001)
new_interval = sys.getswitchinterval()
print(new_interval < interval)
Solution
0.005
True
True

How the switch interval works in Python 3.2+:

Before Python 3.2, the GIL used a "check every N bytecodes" model (sys.getcheckinterval(), default 100 bytecodes). This was problematic: a thread executing expensive bytecodes (like a long sort) might hold the GIL far longer than one executing cheap bytecodes. I/O-bound threads could starve.

Python 3.2 replaced it with a time-based model. Every 5 milliseconds, a "drop request" is set. The currently running thread checks this flag and voluntarily drops the GIL at the next safe bytecode boundary, giving other threads a chance to acquire it.

When to tune:

  • Lower interval (e.g. 0.001): More frequent GIL switching. Better fairness for I/O-bound threads. Higher context-switch overhead per unit of CPU work.
  • Higher interval (e.g. 0.05): Fewer switches. Better throughput for CPU-bound single-thread logic. I/O threads may see higher latency.
  • Default (0.005) is almost always correct. Only tune after profiling.

Key point: Even with a lower switch interval, CPU-bound threads will still dominate because they hold the GIL continuously between check points. The switch interval primarily benefits I/O-bound threads that voluntarily release the GIL while waiting for I/O.

Expected Output
0.005\nTrue\nTrue
Hints

Hint 1: sys.getswitchinterval() returns the current switch interval in seconds.

Hint 2: The default is 0.005 seconds (5 milliseconds) since Python 3.2.

Hint 3: sys.setswitchinterval() lets you change it — lower values mean more frequent GIL releases.


#3Is counter += 1 Atomic Under the GIL?Easy
GILatomicitydisbytecode

Use the dis module to prove that counter += 1 is not atomic. Then state whether it is safe to use from multiple threads without a lock.

Python
import dis

counter = 0

def increment():
    global counter
    counter += 1

# Disassemble the function and look for the key bytecodes
bytecodes = []
for instr in dis.get_instructions(increment):
    bytecodes.append(instr.opname)

# The three critical operations are present
print("LOAD_GLOBAL" in bytecodes or "LOAD_GLOBAL" in str(dis.Bytecode(increment)))
print("BINARY_OP" in bytecodes or "BINARY_OP" in str(dis.Bytecode(increment)))
print("STORE_GLOBAL" in bytecodes or "STORE_GLOBAL" in str(dis.Bytecode(increment)))

# Is counter += 1 atomic under the GIL?
is_atomic = False
print(is_atomic)
Solution
LOAD_GLOBAL
BINARY_OP
STORE_GLOBAL
False

The full disassembly of counter += 1:

import dis

counter = 0

def increment():
global counter
counter += 1

dis.dis(increment)

Output (Python 3.11+):

3 0 LOAD_GLOBAL 0 (counter)
2 LOAD_CONST 1 (1)
4 BINARY_OP 0 (+=)
6 STORE_GLOBAL 0 (counter)
8 LOAD_CONST 0 (None)
10 RETURN_VALUE

Why this is not atomic:

The GIL can be dropped between any two bytecodes. Consider two threads executing increment():

Thread A: LOAD_GLOBAL → reads counter = 100
[GIL released — Thread B runs]
Thread B: LOAD_GLOBAL → reads counter = 100
Thread B: BINARY_OP → computes 101
Thread B: STORE_GLOBAL → writes counter = 101
[GIL released — Thread A resumes]
Thread A: BINARY_OP → computes 101 (using stale read of 100)
Thread A: STORE_GLOBAL → writes counter = 101 ← Thread B's increment is LOST

Both threads read 100, both compute 101, the counter ends at 101 instead of 102. One increment was silently lost.

The fix: Use threading.Lock:

lock = threading.Lock()

def increment():
global counter
with lock:
counter += 1

Or use threading.local() for per-thread counters, or use multiprocessing.Value with lock=True.

Expected Output
LOAD_GLOBAL\nBINARY_OP\nSTORE_GLOBAL\nFalse
Hints

Hint 1: Use the dis module to disassemble a function and see its bytecodes.

Hint 2: counter += 1 compiles to at least 3 separate bytecodes: load, add, store.

Hint 3: The GIL can be released between any two bytecodes — so multi-bytecode operations are NOT atomic.


#4I/O-Bound Threading — Does the GIL Release During I/O?Easy
GILIO-boundthreadingrelease

Verify that threading speeds up I/O-bound work. This confirms that the GIL releases during I/O operations, allowing real concurrency for network and file tasks.

Python
import threading
import time

SLEEP_DURATION = 0.1   # simulate I/O latency (e.g., a network call)
N_THREADS = 5

def io_task():
    time.sleep(SLEEP_DURATION)   # time.sleep() releases the GIL

# Sequential baseline
start = time.perf_counter()
for _ in range(N_THREADS):
    io_task()
sequential_time = time.perf_counter() - start

# Threaded: threads should run concurrently since GIL is released during sleep
start = time.perf_counter()
threads = [threading.Thread(target=io_task) for _ in range(N_THREADS)]
for t in threads:
    t.start()
for t in threads:
    t.join()
threaded_time = time.perf_counter() - start

print(sequential_time > threaded_time * 2)   # threading is significantly faster
print(threaded_time < SLEEP_DURATION * 2)    # threaded time is close to one sleep duration
Solution
True
True

Why I/O-bound threading works despite the GIL:

When a thread calls a blocking I/O function — time.sleep(), socket.recv(), file.read(), requests.get() — CPython releases the GIL before making the underlying system call. The operating system then suspends the thread (waiting for I/O) and the GIL is free for other threads to acquire immediately.

The timeline with 5 threads sleeping for 0.1s each:

Thread 1: [ACQUIRE GIL] → [sleep → RELEASE GIL] → [waiting 0.1s] → [ACQUIRE GIL]
Thread 2: [ACQUIRE GIL] → [sleep → RELEASE GIL] → [waiting 0.1s]
Thread 3: [ACQUIRE GIL] → ...
Thread 4: [ACQUIRE GIL] → ...
Thread 5: [ACQUIRE GIL] → ...

All 5 threads start their I/O waits nearly simultaneously. Total time is approximately 0.1s plus thread coordination overhead — far less than 5 × 0.1s = 0.5s for sequential execution.

The rule of thumb:

  • I/O-bound (network calls, file reads, DB queries): threading or asyncio are effective.
  • CPU-bound (number crunching, data transformation): multiprocessing or C extensions that release the GIL.

time.sleep() in production: Always releases the GIL. Any Python standard-library I/O function (socket, file, subprocess) also releases the GIL before the system call.

Expected Output
True\nTrue
Hints

Hint 1: Python releases the GIL before making blocking system calls (read, write, socket operations).

Hint 2: This means I/O-bound threads can run truly concurrently — while one thread waits for I/O, another executes Python bytecodes.

Hint 3: For I/O-bound work, threading is effective even with the GIL.


Medium

#5CPU-Bound Threading vs Multiprocessing BenchmarkMedium
GILCPU-boundmultiprocessingconcurrent.futures

Benchmark sequential, threaded, and multiprocessing approaches for a CPU-bound task. Measure and explain the performance difference.

import time
import threading
from multiprocessing import Pool

def cpu_task(n):
total = 0
for i in range(n):
total += i
return total

# Implement run_sequential(), run_threaded(), run_multiprocess()
# then compare timings
Solution
import time
import threading
from multiprocessing import Pool

def cpu_task(n):
total = 0
for i in range(n):
total += i
return total

N = 5_000_000
WORKERS = 2

def run_sequential():
start = time.perf_counter()
cpu_task(N)
cpu_task(N)
return time.perf_counter() - start

def run_threaded():
start = time.perf_counter()
t1 = threading.Thread(target=cpu_task, args=(N,))
t2 = threading.Thread(target=cpu_task, args=(N,))
t1.start(); t2.start()
t1.join(); t2.join()
return time.perf_counter() - start

def run_multiprocess():
start = time.perf_counter()
with Pool(WORKERS) as pool:
pool.map(cpu_task, [N, N])
return time.perf_counter() - start

seq = run_sequential()
thr = run_threaded()
mp = run_multiprocess()

print(f"Sequential: {seq:.3f}s")
print(f"Threaded: {thr:.3f}s")
print(f"Multiprocess: {mp:.3f}s")
print(f"Threading helped: {thr < seq * 0.8}")
print(f"Multiprocess faster: {mp < seq * 0.8}")

Typical output (on a 2-core machine):

Sequential: 0.420s
Threaded: 0.445s ← slightly SLOWER due to GIL contention
Multiprocess: 0.240s ← ~2x faster
Threading helped: False
Multiprocess faster: True

Why threading is not faster (and sometimes slower) for CPU-bound work:

With two CPU-bound threads:

  1. Thread 1 acquires the GIL and starts executing bytecodes.
  2. After 5ms (the switch interval), a drop request is set.
  3. Thread 1 drops the GIL. Thread 2 wakes up, acquires it, and runs.
  4. Both threads alternate on a single CPU core — no parallelism.
  5. Overhead from lock contention and OS context switches makes it slightly slower than sequential.

Why multiprocessing is faster:

Each process has its own CPython interpreter, its own GIL, and its own OS process. The OS scheduler can run them on separate CPU cores simultaneously. Two processes on two cores give approximately 2x throughput.

The tradeoff:

ApproachBest forOverhead
threadingI/O-bound (network, DB, file)Low (shared memory)
multiprocessingCPU-bound (compute, transform)High (process spawn + IPC)
concurrent.futures.ProcessPoolExecutorCPU-bound (cleaner API)Same as multiprocessing
import time
import threading
from multiprocessing import Pool

def cpu_task(n):
  """A CPU-bound task: count down from n."""
  total = 0
  for i in range(n):
      total += i
  return total

N = 5_000_000
WORKERS = 2

def run_sequential():
  """Run two cpu_tasks sequentially."""
  start = time.perf_counter()
  cpu_task(N)
  cpu_task(N)
  return time.perf_counter() - start

def run_threaded():
  """Run two cpu_tasks in parallel using threads."""
  start = time.perf_counter()
  t1 = threading.Thread(target=cpu_task, args=(N,))
  t2 = threading.Thread(target=cpu_task, args=(N,))
  t1.start(); t2.start()
  t1.join(); t2.join()
  return time.perf_counter() - start

def run_multiprocess():
  """Run two cpu_tasks in parallel using multiprocessing."""
  start = time.perf_counter()
  with Pool(WORKERS) as pool:
      pool.map(cpu_task, [N, N])
  return time.perf_counter() - start

seq   = run_sequential()
thr   = run_threaded()
mp    = run_multiprocess()

print(f"Sequential:      {seq:.3f}s")
print(f"Threaded:        {thr:.3f}s")
print(f"Multiprocess:    {mp:.3f}s")

# What relationship should hold for CPU-bound work?
print(f"Threading helped: {thr < seq * 0.8}")       # expect False
print(f"Multiprocess faster: {mp < seq * 0.8}")
Expected Output
Threading helped: False\nMultiprocess faster: True
Hints

Hint 1: For CPU-bound tasks, two threads compete for the GIL. Total CPU time is the same as sequential — threads take turns, they do not run in parallel.

Hint 2: Multiprocessing spawns separate processes, each with its own GIL. Two processes can run on two CPU cores simultaneously.

Hint 3: Threaded CPU-bound code can actually be SLOWER than sequential due to GIL contention and context-switch overhead.


#6Fixing a Race Condition with threading.LockMedium
GILrace-conditionthreading.Lockatomicity

Implement SafeCounter.increment() using a threading.Lock to eliminate the race condition present in UnsafeCounter. Then run a stress test to confirm correctness.

import threading

class SafeCounter:
def __init__(self):
self.value = 0
self._lock = threading.Lock()

def increment(self):
pass # make this thread-safe
Solution
import threading

class UnsafeCounter:
def __init__(self):
self.value = 0

def increment(self):
self.value += 1

class SafeCounter:
def __init__(self):
self.value = 0
self._lock = threading.Lock()

def increment(self):
with self._lock:
self.value += 1

def stress_test(counter_class, n_threads=50, increments_per_thread=10_000):
counter = counter_class()
threads = [
threading.Thread(
target=lambda: [counter.increment() for _ in range(increments_per_thread)]
)
for _ in range(n_threads)
]
for t in threads:
t.start()
for t in threads:
t.join()
return counter.value

expected = 50 * 10_000

unsafe_result = stress_test(UnsafeCounter)
safe_result = stress_test(SafeCounter)

print(f"Expected: {expected}")
print(f"Unsafe result: {unsafe_result}")
print(f"Safe result: {safe_result}")
print(f"Unsafe is wrong: {unsafe_result != expected}")
print(f"Safe is correct: {safe_result == expected}")

How the lock makes increment() atomic:

Without lock (UNSAFE):
Thread A: LOAD self.value (reads 499)
[GIL switches to Thread B]
Thread B: LOAD self.value (reads 499) ← stale read!
Thread B: ADD 1 → 500
Thread B: STORE self.value = 500
[GIL switches to Thread A]
Thread A: ADD 1 → 500 ← Thread B's write is OVERWRITTEN
Thread A: STORE self.value = 500 ← should be 501

With lock (SAFE):
Thread A: acquire lock → LOAD → ADD → STORE → release lock
Thread B: [blocked on lock.acquire() until Thread A releases]
Thread B: acquire lock → LOAD → ADD → STORE → release lock

with self._lock: is equivalent to:

self._lock.acquire()
try:
self.value += 1
finally:
self._lock.release()

The with form is preferred because it releases the lock even if an exception is raised inside the block.

Alternative: threading.RLock (re-entrant lock) — allows the same thread to acquire the lock multiple times without deadlocking. Use when a locked method calls another method that also needs the same lock.

import threading

# Version 1: broken — race condition
class UnsafeCounter:
  def __init__(self):
      self.value = 0

  def increment(self):
      self.value += 1

# Version 2: fixed — use a Lock
class SafeCounter:
  def __init__(self):
      self.value = 0
      self._lock = threading.Lock()

  def increment(self):
      pass  # implement with lock

def stress_test(counter_class, n_threads=50, increments_per_thread=10_000):
  """Run n_threads threads each calling increment() n times.
  Return the final counter value."""
  counter = counter_class()
  threads = [
      threading.Thread(target=lambda: [counter.increment() for _ in range(increments_per_thread)])
      for _ in range(n_threads)
  ]
  for t in threads:
      t.start()
  for t in threads:
      t.join()
  return counter.value

expected = 50 * 10_000   # 500_000

unsafe_result = stress_test(UnsafeCounter)
safe_result   = stress_test(SafeCounter)

print(f"Expected:      {expected}")
print(f"Unsafe result: {unsafe_result}")
print(f"Safe result:   {safe_result}")
print(f"Unsafe is wrong: {unsafe_result != expected}")
print(f"Safe is correct: {safe_result == expected}")
Expected Output
Expected:      500000\nSafe is correct: True\nUnsafe is wrong: True
Hints

Hint 1: In SafeCounter.increment(), acquire self._lock before reading and writing self.value.

Hint 2: Use a with statement: "with self._lock:" to ensure the lock is always released.

Hint 3: The lock makes the read-modify-write sequence atomic at the Python level.


#7C Extension GIL Release — NumPy ParallelismMedium
GILC-extensionnumpyGIL-release

Demonstrate that C extensions like NumPy release the GIL, enabling real parallel execution across threads for CPU-bound array operations.

import threading
import time
import numpy as np

# NumPy calls Py_BEGIN_ALLOW_THREADS before its C inner loops
# This releases the GIL — other threads can run while NumPy computes

# Your task: benchmark sequential vs threaded numpy sums
# and confirm that threading provides a speedup
Solution
import threading
import time
import numpy as np

N = 10_000_000
arr1 = np.arange(N, dtype=np.float64)
arr2 = np.arange(N, dtype=np.float64)

def numpy_sum(arr):
return np.sum(arr)

# Sequential
start = time.perf_counter()
numpy_sum(arr1)
numpy_sum(arr2)
seq_time = time.perf_counter() - start

# Threaded
results = {}
def run(key, arr):
results[key] = numpy_sum(arr)

start = time.perf_counter()
t1 = threading.Thread(target=run, args=("a", arr1))
t2 = threading.Thread(target=run, args=("b", arr2))
t1.start(); t2.start()
t1.join(); t2.join()
threaded_time = time.perf_counter() - start

print(f"Sequential: {seq_time:.3f}s")
print(f"Threaded: {threaded_time:.3f}s")
print(f"NumPy threading speedup: {seq_time / threaded_time:.2f}x")
print(f"GIL released by C extension: {threaded_time < seq_time * 0.9}")

How C extensions release the GIL:

In CPython's C API, a C extension author wraps their compute loop with two macros:

// Before the compute loop:
Py_BEGIN_ALLOW_THREADS
// Heavy computation here — GIL is NOT held
// Pure C code that doesn't touch Python objects
for (i = 0; i < n; i++) {
result += array[i];
}
Py_END_ALLOW_THREADS
// GIL is re-acquired here

These macros expand to PyEval_SaveThread() (releases the GIL, saves thread state) and PyEval_RestoreThread() (re-acquires the GIL).

Which libraries release the GIL:

  • NumPy: array operations, matrix multiplications
  • pandas: many Series/DataFrame operations
  • Pillow: image processing operations
  • SQLite3 (stdlib): database queries
  • hashlib: hashing operations
  • zlib/gzip: compression
  • socket: all blocking I/O

Which libraries do NOT release the GIL:

  • Any pure Python library
  • ctypes calls to C functions that touch Python objects

This is why CPU-bound data science workloads can use threading effectively — they are largely running NumPy C code with the GIL released.

import threading
import time

# Simulate what a C extension like NumPy does internally:
# pure Python sum (holds GIL the whole time)
def python_sum(n):
  return sum(range(n))

# A real C extension would release the GIL during computation.
# We can simulate this with ctypes.CDLL or just use numpy directly.
# For this exercise, use numpy to demonstrate parallel CPU work.
try:
  import numpy as np

  def numpy_sum(arr):
      """NumPy's sum releases the GIL during its C inner loop."""
      return np.sum(arr)

  N = 10_000_000
  arr1 = np.arange(N, dtype=np.float64)
  arr2 = np.arange(N, dtype=np.float64)

  # Sequential numpy sums
  start = time.perf_counter()
  r1 = numpy_sum(arr1)
  r2 = numpy_sum(arr2)
  seq_time = time.perf_counter() - start

  # Threaded numpy sums — should be faster because numpy releases the GIL
  results = {}
  def run(key, arr):
      results[key] = numpy_sum(arr)

  start = time.perf_counter()
  t1 = threading.Thread(target=run, args=("a", arr1))
  t2 = threading.Thread(target=run, args=("b", arr2))
  t1.start(); t2.start()
  t1.join(); t2.join()
  threaded_time = time.perf_counter() - start

  print(f"Sequential:  {seq_time:.3f}s")
  print(f"Threaded:    {threaded_time:.3f}s")
  print(f"NumPy threading speedup: {seq_time / threaded_time:.2f}x")
  print(f"GIL released by C extension: {threaded_time < seq_time * 0.9}")

except ImportError:
  print("numpy not installed — concept demonstration:")
  print("C extensions release the GIL by calling Py_BEGIN_ALLOW_THREADS")
  print("This allows multiple threads to run C code in parallel")
  print("GIL released by C extension: True")
Expected Output
GIL released by C extension: True
Hints

Hint 1: C extensions explicitly release the GIL using the Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS macros.

Hint 2: NumPy releases the GIL during its inner loops, so threading can achieve real parallelism for array operations.

Hint 3: This is different from pure Python code — a NumPy sum running in two threads can truly run on two CPU cores.


#8ThreadPoolExecutor vs ProcessPoolExecutor — Choosing the Right ToolMedium
concurrent.futuresThreadPoolExecutorProcessPoolExecutorGIL

Benchmark ThreadPoolExecutor and ProcessPoolExecutor for both I/O-bound and CPU-bound tasks. Confirm that the correct executor matches the workload type.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# I/O-bound: use ThreadPoolExecutor
# CPU-bound: use ProcessPoolExecutor
Solution
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def fetch_url(url):
time.sleep(0.05)
return f"fetched {url}"

def compute_hash(n):
import hashlib
h = hashlib.sha256()
for i in range(n):
h.update(str(i).encode())
return h.hexdigest()

URLS = [f"https://example.com/{i}" for i in range(20)]
HASH_INPUTS = [50_000] * 4

# I/O benchmark
start = time.perf_counter()
for url in URLS:
fetch_url(url)
io_seq = time.perf_counter() - start

start = time.perf_counter()
with ThreadPoolExecutor(max_workers=10) as ex:
list(ex.map(fetch_url, URLS))
io_thread = time.perf_counter() - start

# CPU benchmark
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as ex:
list(ex.map(compute_hash, HASH_INPUTS))
cpu_thread = time.perf_counter() - start

start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as ex:
list(ex.map(compute_hash, HASH_INPUTS))
cpu_process = time.perf_counter() - start

print(f"I/O - Sequential: {io_seq:.3f}s")
print(f"I/O - ThreadPool: {io_thread:.3f}s")
print(f"CPU - ThreadPool: {cpu_thread:.3f}s")
print(f"CPU - ProcessPool: {cpu_process:.3f}s")
print(f"I/O benefits from threading: {io_thread < io_seq * 0.5}")
print(f"CPU benefits from ProcessPool: {cpu_process < cpu_thread * 0.8}")

Decision matrix:

Workload typeBest toolWhy
I/O-bound (network, DB, file)ThreadPoolExecutorGIL released during I/O; low thread overhead
CPU-bound (math, parsing)ProcessPoolExecutorSeparate GILs per process; real CPU parallelism
Async I/O at scaleasyncioOne thread, no GIL overhead, millions of concurrent waits
C extension computeThreadPoolExecutorExtension releases GIL; threads run in parallel

When ProcessPoolExecutor has high overhead:

  • Process startup takes ~50–100ms on first use (pool creation mitigates this).
  • Data passed to/from workers must be picklable.
  • No shared memory by default — workers communicate via IPC queues.

Rule of thumb: If your task spends more time waiting for I/O than computing, use threads. If it spends more time computing than waiting, use processes (or a GIL-releasing C extension + threads).

import time
import requests
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

# I/O-bound task: simulated with sleep
def fetch_url(url):
  """Simulate an HTTP request with a 0.05s delay."""
  time.sleep(0.05)
  return f"fetched {url}"

# CPU-bound task
def compute_hash(n):
  """Simulate CPU work by hashing n iterations."""
  import hashlib
  h = hashlib.sha256()
  for i in range(n):
      h.update(str(i).encode())
  return h.hexdigest()

URLS = [f"https://example.com/{i}" for i in range(20)]
HASH_INPUTS = [50_000] * 4

def benchmark_io():
  """Compare sequential, ThreadPool, and ProcessPool for I/O-bound work."""
  # Sequential
  start = time.perf_counter()
  for url in URLS:
      fetch_url(url)
  seq = time.perf_counter() - start

  # ThreadPoolExecutor (recommended for I/O)
  start = time.perf_counter()
  with ThreadPoolExecutor(max_workers=10) as ex:
      list(ex.map(fetch_url, URLS))
  thread = time.perf_counter() - start

  return seq, thread

def benchmark_cpu():
  """Compare ThreadPool and ProcessPool for CPU-bound work."""
  # ThreadPoolExecutor (NOT recommended for CPU)
  start = time.perf_counter()
  with ThreadPoolExecutor(max_workers=4) as ex:
      list(ex.map(compute_hash, HASH_INPUTS))
  thread = time.perf_counter() - start

  # ProcessPoolExecutor (recommended for CPU)
  start = time.perf_counter()
  with ProcessPoolExecutor(max_workers=4) as ex:
      list(ex.map(compute_hash, HASH_INPUTS))
  process = time.perf_counter() - start

  return thread, process

io_seq, io_thread = benchmark_io()
cpu_thread, cpu_process = benchmark_cpu()

print(f"I/O - Sequential:  {io_seq:.3f}s")
print(f"I/O - ThreadPool:  {io_thread:.3f}s")
print(f"CPU - ThreadPool:  {cpu_thread:.3f}s")
print(f"CPU - ProcessPool: {cpu_process:.3f}s")

print(f"I/O benefits from threading: {io_thread < io_seq * 0.5}")
print(f"CPU benefits from ProcessPool: {cpu_process < cpu_thread * 0.8}")
Expected Output
I/O benefits from threading: True\nCPU benefits from ProcessPool: True
Hints

Hint 1: ThreadPoolExecutor is efficient for I/O because threads wait for I/O with the GIL released — other threads make progress.

Hint 2: ThreadPoolExecutor does NOT help for CPU-bound work because all threads compete for the same GIL.

Hint 3: ProcessPoolExecutor spawns real OS processes — each gets its own GIL and can use a separate CPU core.


Hard

#9ctypes nogil — Releasing the GIL from Pure PythonHard
ctypesGILnogilC-extension

Use ctypes to call a C library function with the GIL released, and confirm that multiple threads can execute the C function in parallel. Then explain the difference between ctypes.CDLL and ctypes.PyDLL.

import ctypes
import threading
import time

# ctypes.CDLL releases the GIL before each C call
# ctypes.PyDLL holds the GIL during each C call

# Demonstrate GIL release by running N parallel ctypes calls
# and showing the total time is close to one call's duration
Solution
import ctypes
import sys
import threading
import time

if sys.platform == "darwin":
libc = ctypes.CDLL("libc.dylib")
else:
libc = ctypes.CDLL("libc.so.6")

def ctypes_sleep_ms(ms):
libc.usleep(ms * 1000)

def python_sleep_ms(ms):
time.sleep(ms / 1000)

N_THREADS = 4
SLEEP_MS = 50

def run_parallel(sleep_fn):
start = time.perf_counter()
threads = [threading.Thread(target=sleep_fn, args=(SLEEP_MS,))
for _ in range(N_THREADS)]
for t in threads:
t.start()
for t in threads:
t.join()
return time.perf_counter() - start

ctypes_time = run_parallel(ctypes_sleep_ms)
python_time = run_parallel(python_sleep_ms)
sequential_time = SLEEP_MS * N_THREADS / 1000

print(f"Sequential (expected): {sequential_time:.3f}s")
print(f"ctypes parallel: {ctypes_time:.3f}s")
print(f"Python parallel: {python_time:.3f}s")
print(f"ctypes releases GIL: {ctypes_time < sequential_time * 0.5}")
print(f"Python sleep releases GIL: {python_time < sequential_time * 0.5}")

CDLL vs PyDLL — The Key Difference:

import ctypes

# ctypes.CDLL — releases the GIL before each C call (DEFAULT)
# Safe for pure C functions that do not touch Python objects
libc = ctypes.CDLL("libc.so.6") # GIL released during C calls

# ctypes.PyDLL — holds the GIL during each C call
# Required when the C function calls back into the Python C API
# (e.g., calls PyObject_GetAttr, PyList_Append, etc.)
pylib = ctypes.PyDLL("libc.so.6") # GIL held during C calls

When to use PyDLL: Only when the C function being called accesses Python objects via the C API (Py_INCREF, PyList_Append, etc.). If the C function holds the GIL and tries to call back into Python while another thread also holds the GIL, you will deadlock.

Practical application — writing a C extension that releases the GIL:

// In your .c extension:
static PyObject* heavy_compute(PyObject* self, PyObject* args) {
int n;
PyArg_ParseTuple(args, "i", &n);

double result;
Py_BEGIN_ALLOW_THREADS // release GIL here
result = expensive_c_computation(n);
Py_END_ALLOW_THREADS // re-acquire GIL here

return PyFloat_FromDouble(result);
}

This is exactly what NumPy, SciPy, and pandas do — release the GIL for their compute kernels, enabling threading-based parallelism for data-science workloads.

import ctypes
import threading
import time

# ctypes can call C functions that release the GIL.
# When you call a ctypes function, ctypes does NOT hold the GIL
# during the C function's execution (by default for most cases).
# We can demonstrate this by calling a C sleep function via ctypes
# and showing that other Python threads run concurrently.

# Load libc (standard C library)
import sys
if sys.platform == "darwin":
  libc = ctypes.CDLL("libc.dylib")
else:
  libc = ctypes.CDLL("libc.so.6")

# C's usleep takes microseconds
# When called via ctypes, the GIL is released during the call
def ctypes_sleep_ms(ms):
  """Sleep using C's usleep — releases the GIL during the sleep."""
  libc.usleep(ms * 1000)

def python_sleep_ms(ms):
  """Sleep using Python's time.sleep — also releases the GIL."""
  time.sleep(ms / 1000)

N_THREADS = 4
SLEEP_MS  = 50

def run_parallel(sleep_fn):
  """Run N_THREADS threads each calling sleep_fn(SLEEP_MS)."""
  start = time.perf_counter()
  threads = [threading.Thread(target=sleep_fn, args=(SLEEP_MS,))
             for _ in range(N_THREADS)]
  for t in threads:
      t.start()
  for t in threads:
      t.join()
  return time.perf_counter() - start

ctypes_time = run_parallel(ctypes_sleep_ms)
python_time = run_parallel(python_sleep_ms)
sequential_time = SLEEP_MS * N_THREADS / 1000   # expected if no parallelism

print(f"Sequential (expected): {sequential_time:.3f}s")
print(f"ctypes parallel:       {ctypes_time:.3f}s")
print(f"Python parallel:       {python_time:.3f}s")

# Both ctypes and time.sleep release the GIL — threads run concurrently
print(f"ctypes releases GIL: {ctypes_time < sequential_time * 0.5}")
print(f"Python sleep releases GIL: {python_time < sequential_time * 0.5}")
Expected Output
ctypes releases GIL: True\nPython sleep releases GIL: True
Hints

Hint 1: ctypes releases the GIL before calling the underlying C function by default.

Hint 2: You can control this behavior using ctypes.CDLL vs ctypes.PyDLL — PyDLL holds the GIL.

Hint 3: libc.usleep() is a blocking C call — with GIL released, all N_THREADS can sleep concurrently.


#10Detecting GIL Contention with a Heartbeat MonitorHard
GILcontentionmonitoringthreading

Build a GILContentionMonitor that measures GIL contention by tracking delays in a high-frequency heartbeat thread. Use it to quantify the impact of CPU-bound threads competing for the GIL.

import threading
import time

class GILContentionMonitor:
TARGET_INTERVAL = 0.001 # 1ms heartbeat

def _heartbeat(self):
# sleep TARGET_INTERVAL and measure actual elapsed time
pass

def start(self): pass
def stop(self): pass
def delay_ratio(self): pass
Solution
import threading
import time

class GILContentionMonitor:
TARGET_INTERVAL = 0.001

def __init__(self):
self._running = False
self._beats = 0
self._total_delay = 0.0
self._thread = None

def _heartbeat(self):
while self._running:
t0 = time.perf_counter()
time.sleep(self.TARGET_INTERVAL)
elapsed = time.perf_counter() - t0
self._total_delay += elapsed
self._beats += 1

def start(self):
self._running = True
self._beats = 0
self._total_delay = 0.0
self._thread = threading.Thread(target=self._heartbeat, daemon=True)
self._thread.start()

def stop(self):
self._running = False
self._thread.join()
return self.delay_ratio()

def delay_ratio(self):
if self._beats == 0:
return 1.0
avg_interval = self._total_delay / self._beats
return avg_interval / self.TARGET_INTERVAL


def simulate_contention(duration=0.5):
stop_event = threading.Event()

def cpu_hog():
while not stop_event.is_set():
_ = sum(range(10_000))

hogs = [threading.Thread(target=cpu_hog) for _ in range(3)]

monitor = GILContentionMonitor()
monitor.start()
for h in hogs:
h.start()

time.sleep(duration)

stop_event.set()
for h in hogs:
h.join()
ratio_with_contention = monitor.stop()

monitor2 = GILContentionMonitor()
monitor2.start()
time.sleep(duration)
ratio_baseline = monitor2.stop()

return ratio_baseline, ratio_with_contention

baseline, contended = simulate_contention()
print(f"Baseline delay ratio: {baseline:.2f}x")
print(f"Contended delay ratio: {contended:.2f}x")
print(f"GIL contention detected: {contended > baseline * 1.5}")

Why this works as a contention detector:

time.sleep(0.001) releases the GIL while sleeping. When the OS wakes the heartbeat thread after 1ms, it must re-acquire the GIL before executing any Python code. If 3 CPU-bound threads are fighting for the GIL, the heartbeat thread waits in a queue to get it back.

The actual elapsed time per heartbeat beat = sleep time + GIL wait time. When GIL wait time is significant, the delay ratio rises above 1.0.

Production use case:

This pattern is used in frameworks like gevent (via greenlet) and production monitoring tools to detect GIL hot spots. The typical symptoms of GIL contention in production:

  • Request latency p99/p999 spikes under CPU load
  • top shows Python processes at exactly 100% CPU (single-core saturation)
  • Adding more threads increases latency instead of throughput

The diagnostic rule: If delay_ratio exceeds 1.5x consistently, your CPU-bound threads are starving I/O threads. The fix: move CPU work to ProcessPoolExecutor, use a GIL-releasing C extension, or restructure as async.

import threading
import time
from collections import deque

class GILContentionMonitor:
  """
  Measures GIL contention by running a high-frequency heartbeat thread.
  The heartbeat thread increments a counter every TARGET_INTERVAL seconds.
  If the GIL is held by another thread, the heartbeat is delayed.

  A high 'delay ratio' means the GIL is frequently contended.
  """

  TARGET_INTERVAL = 0.001   # 1ms heartbeat target

  def __init__(self):
      self._running = False
      self._beats = 0
      self._total_delay = 0.0
      self._thread = None

  def _heartbeat(self):
      """Run the heartbeat loop."""
      pass  # implement: loop while _running, sleep TARGET_INTERVAL,
            # record actual elapsed time, accumulate delay

  def start(self):
      """Start the heartbeat monitor thread."""
      pass

  def stop(self):
      """Stop the monitor and return the average delay ratio."""
      pass

  def delay_ratio(self):
      """Return avg actual interval / target interval.
      1.0 = no contention, 2.0 = 2x slower than target (high contention).
      """
      if self._beats == 0:
          return 1.0
      avg_interval = self._total_delay / self._beats
      return avg_interval / self.TARGET_INTERVAL


def simulate_contention(duration=0.5):
  """Simulate GIL contention with CPU-bound threads."""
  stop_event = threading.Event()

  def cpu_hog():
      while not stop_event.is_set():
          _ = sum(range(10_000))   # hold the GIL continuously

  hogs = [threading.Thread(target=cpu_hog) for _ in range(3)]

  monitor = GILContentionMonitor()
  monitor.start()
  for h in hogs:
      h.start()

  time.sleep(duration)

  stop_event.set()
  for h in hogs:
      h.join()
  ratio_with_contention = monitor.stop()

  # Now measure baseline (no contention)
  monitor2 = GILContentionMonitor()
  monitor2.start()
  time.sleep(duration)
  ratio_baseline = monitor2.stop()

  return ratio_baseline, ratio_with_contention

baseline, contended = simulate_contention()
print(f"Baseline delay ratio:   {baseline:.2f}x")
print(f"Contended delay ratio:  {contended:.2f}x")
print(f"GIL contention detected: {contended > baseline * 1.5}")
Expected Output
GIL contention detected: True
Hints

Hint 1: _heartbeat should record the actual elapsed time for each sleep(TARGET_INTERVAL) call and add it to _total_delay.

Hint 2: When the GIL is heavily contested, the heartbeat thread waits to re-acquire the GIL after waking up, causing its actual interval to exceed TARGET_INTERVAL.

Hint 3: stop() should set _running = False, join the thread, and return delay_ratio().

Hint 4: Use time.perf_counter() for precise timing measurements.


#11Python 3.13 Free-Threaded Mode — Concepts and TradeoffsHard
GILfree-threadedPython 3.13PEP 703

Explore Python 3.13 free-threaded mode (PEP 703). Implement a is_free_threaded() detector, then show that a threading.Lock makes a counter safe in both standard and free-threaded Python.

import sys
import threading

def is_free_threaded():
"""Detect if running with the GIL disabled (Python 3.13 -X nogil)."""
pass

# Demonstrate that:
# 1. Unsafe counter loses updates in free-threaded mode
# 2. Lock-protected counter is always correct
Solution
import sys
import threading

def is_free_threaded():
if sys.version_info >= (3, 13):
return getattr(sys.flags, 'gil_disabled', False)
return False

def demonstrate_counter_race():
counter = 0
N_THREADS = 10
INCREMENTS = 100_000

def unsafe_increment():
nonlocal counter
for _ in range(INCREMENTS):
counter += 1

threads = [threading.Thread(target=unsafe_increment) for _ in range(N_THREADS)]
for t in threads:
t.start()
for t in threads:
t.join()

expected = N_THREADS * INCREMENTS
return expected, counter, expected - counter

def demonstrate_counter_safe():
counter = 0
lock = threading.Lock()
N_THREADS = 10
INCREMENTS = 100_000

def safe_increment():
nonlocal counter
for _ in range(INCREMENTS):
with lock:
counter += 1

threads = [threading.Thread(target=safe_increment) for _ in range(N_THREADS)]
for t in threads:
t.start()
for t in threads:
t.join()
return counter

print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print(f"Free-threaded mode: {is_free_threaded()}")

expected, actual, lost = demonstrate_counter_race()
safe_result = demonstrate_counter_safe()

print(f"Unsafe - Expected: {expected}, Got: {actual}, Lost: {lost}")
print(f"Safe - Expected: {expected}, Got: {safe_result}")
print(f"Safe counter is correct: {safe_result == expected}")
print(f"Lock required for correctness: True")

PEP 703 — Making the GIL Optional (Python 3.13):

Python 3.13 ships two builds:

  • python3.13 — standard CPython with the GIL (default, fully stable)
  • python3.13t — free-threaded build (experimental, GIL disabled)

What changes in free-threaded mode:

AspectStandard Python (GIL)Free-Threaded (no GIL)
counter += 1might lose updatesalways loses updates
list.append()mostly safenot safe
dict[key] = valmostly safenot safe
Lock-protected codesafesafe
NumPy parallel threadsGIL released anywayGIL no longer needed
Pure Python parallelismimpossiblepossible

The performance tradeoff: Without the GIL, every Python object needs its own fine-grained locks to protect its internal state. The CPython team benchmarked free-threaded Python: single-threaded performance is approximately 40% slower than standard CPython 3.13 (due to per-object lock overhead). The speedup from thread parallelism must exceed this baseline regression to be worthwhile.

When free-threaded Python makes sense:

  • Workloads with true thread-level parallelism in pure Python (currently impossible with the GIL)
  • Multi-threaded servers where request handlers are CPU-bound pure Python
  • Scientific computing where the GIL release pattern of C extensions is not sufficient

Current status (Python 3.13): Experimental. Most third-party packages (NumPy, pandas, SQLAlchemy) are not yet tested for thread safety without the GIL. Expect instability. Python 3.14+ will continue hardening the free-threaded build toward stable status.

import sys
import threading
import time

# Python 3.13 introduced experimental free-threaded mode (PEP 703)
# It can be enabled with: python3.13t (the 't' build)
# or via: PYTHON_GIL=0 python3.13
#
# In free-threaded mode:
# - The GIL is disabled
# - True parallel execution of Python threads is possible
# - BUT: all thread-safety assumptions break — your code needs locks
#
# This exercise tests your understanding of what changes in free-threaded mode.

def is_free_threaded():
  """Return True if running in Python 3.13+ free-threaded mode."""
  # Check if GIL is disabled
  # In Python 3.13+, sys.flags has a 'gil_disabled' attribute
  if sys.version_info >= (3, 13):
      return getattr(sys.flags, 'gil_disabled', False)
  return False

def demonstrate_counter_race():
  """
  In free-threaded Python, counter += 1 with no locks WILL lose updates.
  In standard Python with the GIL, it MIGHT lose updates (non-deterministic).

  Returns: (expected, actual, lost_updates)
  """
  counter = 0
  N_THREADS = 10
  INCREMENTS = 100_000

  def unsafe_increment():
      nonlocal counter
      for _ in range(INCREMENTS):
          counter += 1

  threads = [threading.Thread(target=unsafe_increment) for _ in range(N_THREADS)]
  for t in threads:
      t.start()
  for t in threads:
      t.join()

  expected = N_THREADS * INCREMENTS
  lost = expected - counter
  return expected, counter, lost

def demonstrate_counter_safe():
  """Same counter but with a Lock — safe in ALL Python versions."""
  counter = 0
  lock = threading.Lock()
  N_THREADS = 10
  INCREMENTS = 100_000

  def safe_increment():
      nonlocal counter
      for _ in range(INCREMENTS):
          with lock:
              counter += 1

  threads = [threading.Thread(target=safe_increment) for _ in range(N_THREADS)]
  for t in threads:
      t.start()
  for t in threads:
      t.join()

  return counter

print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print(f"Free-threaded mode: {is_free_threaded()}")

expected, actual, lost = demonstrate_counter_race()
safe_result = demonstrate_counter_safe()

print(f"Unsafe - Expected: {expected}, Got: {actual}, Lost: {lost}")
print(f"Safe   - Expected: {expected}, Got: {safe_result}")
print(f"Safe counter is correct: {safe_result == expected}")
print(f"Lock required for correctness: True")
Expected Output
Safe counter is correct: True\nLock required for correctness: True
Hints

Hint 1: In Python 3.13 free-threaded mode, the GIL is disabled. Multiple threads can truly run Python bytecodes in parallel.

Hint 2: Without the GIL, counter += 1 with no lock WILL lose updates in free-threaded mode — guaranteed, not just likely.

Hint 3: The fix is the same in all Python versions: use threading.Lock to make the read-modify-write atomic.

Hint 4: sys.flags.gil_disabled (Python 3.13+) tells you whether free-threaded mode is active.

© 2026 EngineersOfAI. All rights reserved.