How Python Works Internally

A Production Mystery

It is 2:47 AM. Your ML training job has been running for six hours on an 8-core machine. You check htop and notice something disturbing: all eight cores are sitting at roughly 12% utilization each. The job is using exactly one core's worth of computation, spread thinly across eight. Your colleague added threading to "parallelize" the data preprocessing pipeline three weeks ago. Nobody questioned it. The threads are running, technically. But your dataset loading is just as slow as before, and your training job will still miss the deadline.

This is not a bug in your code. This is the GIL - the Global Interpreter Lock - doing exactly what it was designed to do. To understand why Python behaves this way, you need to understand what Python actually is: not a language in the abstract sense, but a specific implementation called CPython, written in C, with decades of architectural decisions baked into it. The GIL is one of those decisions. It looks like a mistake from the outside. From the inside, it is an elegant solution to a genuinely hard problem.

Most Python engineers treat their runtime as a black box. Code goes in, results come out. This works fine until it doesn't - until you hit a performance cliff, a mysterious memory leak, a subtle concurrency bug, or a profiler output that makes no sense. At that point, the engineers who understand CPython's internals can diagnose and fix the problem in an hour. Everyone else spends a week guessing.

This lesson removes the black box. We will trace a Python program from the characters you type to the machine instructions that execute. We will look at the CPython virtual machine's main evaluation loop, understand how Python manages memory without you having to think about it, and explain why the GIL exists at a level of detail that lets you predict its behavior. You will come out of this lesson understanding not just what Python does, but why - and that distinction matters enormously when you are debugging production systems.

The production implications are everywhere. Why does multiprocessing outperform threading for CPU-bound work? Why does del x not always free memory? Why do circular references leak? Why does importing a module the first time take 50ms but subsequent imports take microseconds? All of these questions have precise, mechanical answers rooted in CPython's architecture. We will get to all of them.

Why This Exists - The Problem Before CPython

Before CPython, there was no Python at all. Guido van Rossum created Python in the late 1980s as a scripting language for the Amoeba operating system. The implementation choice - a bytecode-compiled, interpreter-executed design - was not arbitrary. It was the right tradeoff for 1990s hardware and the language's intended use cases.

The alternative would have been direct compilation to machine code (like C or Fortran). That approach is fast, but it requires a type system rigid enough for the compiler to make decisions ahead of time. Python's dynamic typing - where any variable can hold any type at any moment - makes ahead-of-time compilation extraordinarily difficult. You simply cannot know at compile time whether x + y will call int.__add__, float.__add__, str.__add__, or some custom __add__ on a user-defined class.

The bytecode interpreter design solves this by deferring all type decisions to runtime. The compiler translates Python source into a simple instruction set (bytecode). The interpreter executes those instructions one at a time, checking types and dispatching to the right implementation at each step. This is slower than compiled code but orders of magnitude more flexible. It is the right design for a language that prioritizes developer productivity over raw throughput.

CPython is the reference implementation - the one at python.org. There are others: PyPy, Jython, IronPython, MicroPython. But when you run python on your machine, you are almost certainly running CPython, and that is the implementation we will focus on.

Historical Context - Who Built This and When

Guido van Rossum began Python's development in December 1989, releasing version 0.9.0 in February 1991. The bytecode compilation pipeline was present from the earliest versions. The .pyc file format - compiled bytecode cached to disk - was added early to avoid recompiling unchanged source files on every import.

The GIL was introduced as a pragmatic solution to a specific problem: CPython's internal data structures are not thread-safe. Making them thread-safe with fine-grained locks would have been complex, error-prone, and - as Guido later argued - slower for single-threaded programs (which were the dominant use case in 1992). The GIL is a single lock protecting the interpreter state. One thread runs at a time. The simplicity was the point.

The memory allocator pymalloc was introduced in Python 2.1 (2001) to address a performance problem: CPython was calling the system malloc/free for every single Python object allocation, which is extremely expensive for small, short-lived objects. pymalloc introduced a layered arena/pool/block allocator optimized for Python's allocation patterns.

The import system was substantially overhauled in Python 3.1 with importlib - the import machinery written in Python itself, making it inspectable and extensible. Before importlib, the import system was a tangled mixture of C code and undocumented behavior.

Python 3.11 (2022) introduced significant performance improvements: "adaptive specializing interpreter" which speculatively specializes bytecode instructions based on observed types, giving 10-60% speedups on real workloads without a full JIT. Python 3.12 and 3.13 continued this work.

The CPython Pipeline - Source to Execution

When you run python script.py, a surprisingly complex pipeline executes before your first line of code runs. Understanding this pipeline is the foundation for understanding everything else.

Stage 1: Lexing and Tokenization

The lexer takes raw text and produces a stream of tokens. A token is the smallest meaningful unit of syntax: a keyword (def, if, return), an identifier (x, my_function), a literal (42, "hello"), an operator (+, ==), or punctuation ((, :, newline).

Python's tokenize module exposes the lexer directly:

import tokenize
import io

source = """
def add(x, y):
    return x + y
"""

tokens = tokenize.generate_tokens(io.StringIO(source).readline)
for tok in tokens:
    print(tok)

This gives you a sequence of (token_type, string, start, end, line) tuples. The lexer handles Python's unusual indentation-based syntax by generating INDENT and DEDENT pseudo-tokens when indentation levels change. This is why you cannot mix tabs and spaces - the lexer uses indentation to determine block structure.

Stage 2: Parsing and the AST

The parser takes the token stream and builds an Abstract Syntax Tree (AST). In Python 3.9, the parser was rewritten from an LL(1) recursive descent parser to a PEG (Parsing Expression Grammar) parser. PEG parsers are more powerful and the rewrite enabled some syntax that was impossible before (like parenthesized context managers: with (A() as a, B() as b):).

The ast module exposes the AST:

import ast

source = """
def add(x, y):
    return x + y
"""

tree = ast.parse(source)
print(ast.dump(tree, indent=2))

Output (abbreviated):

Module(
  body=[
    FunctionDef(
      name='add',
      args=arguments(
        args=[arg(arg='x'), arg(arg='y')]
      ),
      body=[
        Return(
          value=BinOp(
            left=Name(id='x'),
            op=Add(),
            right=Name(id='y')
          )
        )
      ]
    )
  ]
)

The AST is what tools like linters, formatters (black), and type checkers (mypy) operate on. You can manipulate it before compilation - this is how some metaprogramming frameworks work.

Stage 3: Compilation to Bytecode

The compiler walks the AST and produces a PyCodeObject - a data structure containing the bytecode instructions, constants, variable names, and metadata for a function or module. This is the compile() built-in:

code = compile(source, "<string>", "exec")
print(type(code))        # <class 'code'>
print(code.co_code)      # raw bytecode bytes
print(code.co_consts)    # constants: (None,)
print(code.co_varnames)  # local variable names
print(code.co_names)     # global/attribute names

The code object has many interesting attributes:

co_code - the raw bytecode as bytes
co_consts - tuple of constants used in the code
co_varnames - names of local variables
co_names - names of globals and attributes
co_freevars - names captured from enclosing scope (closures)
co_cellvars - local variables captured by inner functions
co_filename - source file
co_firstlineno - first line number
co_lnotab / co_linetable - mapping from bytecode offset to line number

Inspecting Bytecode with dis

The dis module disassembles bytecode into human-readable form. This is one of the most useful tools for understanding what Python is actually doing.

import dis

def add(x, y):
    return x + y

dis.dis(add)

Output:

  2           0 LOAD_FAST                0 (x)
              2 LOAD_FAST                1 (y)
              4 BINARY_OP               0 (+)
              6 RETURN_VALUE

Each line shows: line number, bytecode offset, opcode name, argument, and (in parentheses) the resolved value of the argument.

Let us look at more interesting cases:

import dis

# A for loop - reveals GET_ITER and FOR_ITER mechanics
def sum_loop(items):
    total = 0
    for item in items:
        total += item
    return total

dis.dis(sum_loop)

Output for sum_loop:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (total)

  3           4 LOAD_FAST                0 (items)
              6 GET_ITER
        >>    8 FOR_ITER                 4 (to 18)
             10 STORE_FAST               2 (item)

  4          12 LOAD_FAST                1 (total)
             14 LOAD_FAST                2 (item)
             16 INPLACE_ADD
             18 STORE_FAST               1 (total)

  5     >>   20 JUMP_BACKWARD            7 (to 8)

  6     >>   22 LOAD_FAST                1 (total)
             24 RETURN_VALUE

Notice GET_ITER - Python's for loop always works through an iterator protocol. FOR_ITER pops the next value or jumps to the exit offset if the iterator is exhausted. JUMP_BACKWARD brings us back to FOR_ITER.

import dis

# List comprehension - compiles to its own nested code object
def make_squares(n):
    return [x**2 for x in range(n)]

print("=== Outer function ===")
dis.dis(make_squares)

print("\n=== Inner comprehension code object ===")
inner = make_squares.__code__.co_consts[0]
dis.dis(inner)

The comprehension is a completely separate code object, executed by GET_ITER/CALL_INTRINSIC_1. This is why list comprehensions have their own scope in Python 3 - they run in a nested frame.

import dis

# Closures - shows LOAD_DEREF and STORE_DEREF
def make_counter(start):
    count = start
    def increment():
        nonlocal count
        count += 1
        return count
    return increment

print("=== Outer make_counter ===")
dis.dis(make_counter)
print("\n=== Inner increment ===")
counter = make_counter(0)
dis.dis(counter.__code__)

Closures use LOAD_DEREF and STORE_DEREF to access cell variables - variables shared between an outer function and its nested functions. This is different from LOAD_FAST (local variable) and LOAD_GLOBAL (module global).

import dis

# Exception handling - PUSH_EXC_INFO, POP_EXCEPT
def safe_divide(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        return None

dis.dis(safe_divide)

Exception handling in Python 3.11+ uses an exception table stored separately from the bytecode (rather than inline SETUP_FINALLY instructions). This made the normal (no-exception) path faster since there are no exception setup instructions in the hot path.

The .pyc File Format

When Python compiles a module, it caches the bytecode in __pycache__/module.cpython-311.pyc. The format:

[4 bytes magic number]
[4 bytes bit field]
[8 bytes source timestamp or hash]
[4 bytes source size]
[marshaled code object]

The magic number changes with each Python version. If the magic number does not match, Python recompiles from source.

import marshal
import struct
import time

def read_pyc(filename):
    with open(filename, 'rb') as f:
        magic = f.read(4)
        bit_field = struct.unpack('<I', f.read(4))[0]
        mtime = struct.unpack('<Q', f.read(8))[0]
        source_size = struct.unpack('<I', f.read(4))[0]
        code = marshal.load(f)

    print(f"Magic:       {magic.hex()}")
    print(f"Bit field:   {bit_field:#010b}")
    print(f"Timestamp:   {time.ctime(mtime)}")
    print(f"Source size: {source_size} bytes")
    print(f"Code type:   {type(code)}")
    print(f"First line:  {code.co_firstlineno}")
    return code

# Find a pyc file to inspect
import os, sys
pyc_path = os.path.join(
    os.path.dirname(os.__file__),
    "__pycache__",
    f"os.cpython-{sys.version_info.major}{sys.version_info.minor}.pyc"
)
if os.path.exists(pyc_path):
    read_pyc(pyc_path)

The CPython Virtual Machine

The heart of CPython is the evaluation loop in Python/ceval.c. It is a giant switch statement (or computed goto table, depending on the platform) that executes one bytecode instruction at a time.

The eval loop is a stack machine. Each frame has a value stack. Instructions push values onto the stack, pop values to operate on them, and push results back.

A Python frame (PyFrameObject) holds:

A pointer to the code object (f_code)
A pointer to the global namespace (f_globals)
A pointer to the local namespace (f_locals, usually a C array for speed)
The instruction pointer (f_lasti)
The value stack
Exception state

How Opcodes Actually Work - Tracing `x + y`

Let us trace x + y for two integers at the C level:

LOAD_FAST 0 - pushes fastlocals[0] (a PyObject* pointing to the int object for x) onto the value stack. Cost: one array index, one stack push.
LOAD_FAST 1 - pushes fastlocals[1] (the int object for y).
BINARY_OP 0 (+) - pops both operands, calls PyNumber_Add(left, right).
PyNumber_Add checks left->ob_type->tp_as_number->nb_add. For int objects this is long_add in Objects/longobject.c.
long_add allocates a new PyLongObject, computes the sum, sets its value, sets its refcount to 1.
The new object pointer is pushed onto the value stack.
Reference counts on left and right are decremented. If either hits zero, they are freed immediately.
RETURN_VALUE pops the top of the stack and returns it to the calling frame.

This is why Python is "slow" for numerical work: x + y for two floats involves roughly 30 C function calls, one heap allocation, and reference count bookkeeping. NumPy avoids this entirely by operating on contiguous C arrays with no Python object creation per element.

The Python Object Model

Everything in Python is a PyObject. Literally everything - integers, strings, functions, classes, modules, None. Understanding PyObject is the foundation of understanding Python's memory model.

/* Simplified from Include/object.h */
typedef struct _object {
    Py_ssize_t ob_refcnt;    /* reference count */
    PyTypeObject *ob_type;   /* pointer to type object */
} PyObject;

/* An integer object */
typedef struct {
    PyObject ob_base;        /* must be first - enables casting */
    Py_ssize_t ob_size;      /* number of "digits" in representation */
    digit ob_digit[1];       /* the actual digits (variable length) */
} PyLongObject;

Every object has exactly two mandatory fields: a reference count and a type pointer. The type pointer points to a PyTypeObject (which is itself a PyObject) containing:

Method pointers: tp_add, tp_repr, tp_hash, tp_call, tp_iter, etc.
The type's name and documentation
Memory layout information (tp_basicsize, tp_itemsize)
Inheritance information (tp_base, tp_bases)
MRO (Method Resolution Order) cache

When you call str(42), Python: follows 42->ob_type (which is &PyLong_Type), looks up tp_str (which points to long_to_decimal_string), calls it with the integer object. No dictionary lookup, no name resolution - it is a direct C function pointer call.

import sys

# Every object has a reference count
x = [1, 2, 3]
print(sys.getrefcount(x))  # 2: one from x, one from getrefcount argument

y = x
print(sys.getrefcount(x))  # 3: x, y, and getrefcount argument

del y
print(sys.getrefcount(x))  # 2 again

# Small integer caching - CPython caches integers from -5 to 256
a = 42
b = 42
print(a is b)    # True - same PyObject in memory!

a = 1000
b = 1000
print(a is b)    # False - different PyObject instances

# Verify the cached integer addresses are identical
print(id(42) == id(42))    # True
print(id(1000) == id(1000))  # Could be True (compiler optimization in source)
                              # but at runtime in loops: False

String Interning

Identifier-like strings (no spaces, alphanumeric + underscore) are interned automatically. Python keeps exactly one copy in a global dictionary and all variables pointing to "hello" share the same object.

import sys

# Strings that look like identifiers are interned
a = "hello"
b = "hello"
print(a is b)     # True - same interned object

# Strings with spaces are NOT automatically interned
a = "hello world"
b = "hello world"
print(a is b)     # False (usually)

# Force interning
a = sys.intern("hello world")
b = sys.intern("hello world")
print(a is b)     # True - now interned

# Dictionary key lookups benefit enormously from interning
# String comparison with `is` instead of byte-by-byte comparison
d = {}
key = sys.intern("my_key")
d[key] = 42
print(d[key])     # Fast: identity check first, then hash comparison

Reference Counting and the Cycle Garbage Collector

CPython uses reference counting as its primary memory management strategy. Each object tracks how many references point to it. When the count reaches zero, the object is immediately deallocated - no GC pause, no latency spike.

import gc
import sys

class Node:
    def __init__(self, name):
        self.name = name
        self.next = None

    def __del__(self):
        print(f"Deleting {self.name}")

# Simple deallocation - reference counting handles this perfectly
n = Node("A")
print(sys.getrefcount(n))  # 2 (n + getrefcount arg)

del n  # Prints "Deleting A" immediately - refcount hits 0
print("After del")

# Circular reference - reference counting fails here
a = Node("A")
b = Node("B")
a.next = b   # b's refcount = 2 (b variable + a.next)
b.next = a   # a's refcount = 2 (a variable + b.next)

del a    # a's refcount goes to 1 (b.next still points to it)
del b    # b's refcount goes to 1 (a.next still points to it)
# Neither object reaches 0! Memory is retained until GC runs.
print("Objects still alive - GC has not run yet")

gc.collect()   # Force generational GC
print("After gc.collect() - now they are deleted")

The cycle garbage collector (Modules/gcmodule.c) handles reference cycles using a generational scheme:

import gc

# Inspect the garbage collector thresholds
print(gc.get_threshold())    # (700, 10, 10) by default
print(gc.get_count())        # current (gen0, gen1, gen2) object counts

# Tune for latency-sensitive applications
# Increase thresholds to reduce collection frequency at the cost of memory
gc.set_threshold(1000, 15, 15)

# Disable automatic GC for batch processing (re-enable after)
gc.disable()
# ... batch code ...
gc.collect()  # Manual collection at controlled point
gc.enable()

# Find cyclic garbage (objects gc could not collect)
gc.set_debug(gc.DEBUG_SAVEALL)
# ... create cycles ...
gc.collect()
print(gc.garbage)  # Normally empty; non-empty means __del__ in a cycle

# Use weakref to avoid cycles in parent/child relationships
import weakref

class Parent:
    def __init__(self):
        self.children = []
    def add_child(self, child):
        self.children.append(child)
        child.parent = weakref.ref(self)  # Weak reference - does not increase refcount

class Child:
    pass

p = Parent()
c = Child()
p.add_child(c)
# Now del p - it will be freed immediately because c.parent is weak

:::warning Performance Impact of Cycles Every circular reference you create adds work for the cycle GC. In performance-critical code - especially objects created in tight loops - use __slots__, be careful with parent/child references, and use weakref.ref for back-references where you do not need the reference to keep the target alive. In web servers, GC pauses during generation 2 collection can cause noticeable latency spikes. :::

The GIL - Global Interpreter Lock

The GIL is CPython's most controversial design decision. To understand it, you need to understand what it is protecting.

CPython's reference counting is not atomic. Incrementing and decrementing ob_refcnt involves a read-modify-write operation. If two threads simultaneously decrement the same object's reference count from 1 to 0, you get a race: both read refcnt = 1, both try to deallocate the object, you get a double-free and memory corruption.

Making every reference count operation atomic (with CPU compare-and-swap instructions) would work, but adds ~10-30% overhead to all reference counting in single-threaded code - the dominant use case. The GIL is simpler: one lock, one thread runs at a time, reference counting is always safe.

GIL Switching Mechanics

The GIL is not held forever by one thread. CPython switches every sys.getswitchinterval() seconds (default: 5ms):

After switch_interval seconds, the interpreter sets eval_breaker = 1 in the thread state.
At the bottom of each opcode in the eval loop, CPython checks eval_breaker.
If set, it calls _PyEval_HandlePendingCalls() which checks for GIL drop requests.
The current thread drops the GIL and signals waiting threads.
A waiting thread acquires the GIL and continues.

import sys
import threading
import time

print(f"Switch interval: {sys.getswitchinterval() * 1000:.1f}ms")  # 5.0ms default

# For I/O-bound servers: shorter interval gives more responsive switching
sys.setswitchinterval(0.001)   # 1ms - trades throughput for responsiveness

# Demonstrate: two CPU-bound threads are NOT faster (and may be slower)
def cpu_work(n=10_000_000):
    total = 0
    for i in range(n):
        total += i
    return total

def time_serial():
    t = time.perf_counter()
    cpu_work()
    cpu_work()
    return time.perf_counter() - t

def time_threaded():
    t = time.perf_counter()
    threads = [threading.Thread(target=cpu_work) for _ in range(2)]
    for th in threads: th.start()
    for th in threads: th.join()
    return time.perf_counter() - t

serial   = time_serial()
threaded = time_threaded()
print(f"Serial:   {serial:.3f}s")
print(f"Threaded: {threaded:.3f}s")   # Same or SLOWER due to GIL contention overhead!
print(f"Overhead: {(threaded/serial - 1)*100:.1f}%")

C Extensions and the GIL

C extensions can release the GIL explicitly using Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS macros. This is how NumPy enables real parallelism:

import numpy as np
import threading
import time

# NumPy releases the GIL for most array operations
def numpy_matmul():
    a = np.random.randn(512, 512)
    return np.dot(a, a.T)   # GIL released during this C operation

def time_numpy_serial():
    t = time.perf_counter()
    for _ in range(4): numpy_matmul()
    return time.perf_counter() - t

def time_numpy_threaded():
    t = time.perf_counter()
    threads = [threading.Thread(target=numpy_matmul) for _ in range(4)]
    for th in threads: th.start()
    for th in threads: th.join()
    return time.perf_counter() - t

serial   = time_numpy_serial()
threaded = time_numpy_threaded()
print(f"Serial numpy:   {serial:.3f}s")
print(f"Threaded numpy: {threaded:.3f}s")   # Significantly faster!
print(f"Speedup: {serial/threaded:.1f}x")

The rule: threading works when the hot path is in C code that releases the GIL. Threading does not work when the hot path is in Python bytecode.

Memory Management - pymalloc

Python objects are allocated and freed millions of times per second. Calling system malloc/free for every allocation would be prohibitively expensive. CPython uses pymalloc for objects up to 512 bytes.

The three levels:

Arenas - 256KB blocks obtained from the OS. Tracked in a global array. Returned to OS only when completely empty (which rarely happens in practice).
Pools - 4KB pages within arenas. Each pool is dedicated to one size class (8 to 512 bytes in 8-byte increments, giving 64 size classes). A request for 17 bytes gets rounded up to 24 bytes and served from a pool for 24-byte blocks.
Blocks - Fixed-size chunks within a pool. Freed blocks are kept in a per-pool free list and reused immediately without touching the OS.

import sys
import tracemalloc

# Measure object sizes
print(f"int (42):       {sys.getsizeof(42)} bytes")
print(f"int (10**100):  {sys.getsizeof(10**100)} bytes")   # Big int is bigger!
print(f"float:          {sys.getsizeof(3.14)} bytes")
print(f"str (empty):    {sys.getsizeof('')} bytes")
print(f"str ('hello'):  {sys.getsizeof('hello')} bytes")
print(f"list (empty):   {sys.getsizeof([])} bytes")
print(f"list ([1,2,3]): {sys.getsizeof([1,2,3])} bytes")
print(f"dict (empty):   {sys.getsizeof({})} bytes")
print(f"tuple (empty):  {sys.getsizeof(())} bytes")

# Note: sys.getsizeof does NOT count referenced objects
# This is "shallow" size - only the container itself
big_list = list(range(1000))
print(f"list of 1000 ints (shallow): {sys.getsizeof(big_list)} bytes")
# Each int element also occupies ~28 bytes but getsizeof does not count them

# tracemalloc gives you deep allocation tracking
tracemalloc.start()

data = [{"value": i, "squared": i * i} for i in range(10_000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("\nTop 3 allocations:")
for stat in top_stats[:3]:
    print(f"  {stat}")

tracemalloc.stop()

# __slots__ reduces per-instance memory significantly
class RegularNode:
    def __init__(self, x, y, value):
        self.x = x
        self.y = y
        self.value = value
    # Has __dict__: dynamic attributes, ~200-400 bytes overhead

class SlottedNode:
    __slots__ = ('x', 'y', 'value')
    def __init__(self, x, y, value):
        self.x = x
        self.y = y
        self.value = value
    # No __dict__: fixed layout, much smaller

import sys
r = RegularNode(1, 2, 3)
s = SlottedNode(1, 2, 3)
print(f"Regular: {sys.getsizeof(r) + sys.getsizeof(r.__dict__)} bytes")
print(f"Slotted: {sys.getsizeof(s)} bytes")
# Typical output: Regular: 232 bytes, Slotted: 64 bytes

# At 1 million nodes, __slots__ saves ~168 MB
n = 1_000_000
regular_mb = n * (sys.getsizeof(RegularNode(1,2,3)) + sys.getsizeof({})) / 1e6
slotted_mb = n * sys.getsizeof(SlottedNode(1,2,3)) / 1e6
print(f"1M regular nodes: {regular_mb:.0f} MB")
print(f"1M slotted nodes: {slotted_mb:.0f} MB")

:::danger Memory Not Returned to OS Python's allocator holds arenas even after all objects in them are freed. A process that allocates and frees 10GB of Python objects may still show 10GB RSS in top or htop. The memory is available within pymalloc but not returned to the OS. For long-running services like web servers, this causes apparent memory leaks. Use gc.collect() and process-level monitoring. For extreme cases, consider worker process recycling (gunicorn --max-requests). :::

The Import System

Python's import system is itself written in Python (via importlib), making it inspectable and extensible.

import sys
import importlib
import importlib.util
import time

# sys.modules is the module cache
print('json' in sys.modules)   # False initially

import json
print('json' in sys.modules)   # True now

# Second import is instantaneous - just a dict lookup
t = time.perf_counter()
import json   # Cache hit
elapsed = (time.perf_counter() - t) * 1e6
print(f"Cached import: {elapsed:.2f} microseconds")

# Inspect where a module came from
print(json.__file__)   # /usr/lib/python3.x/json/__init__.py
print(json.__spec__)   # ModuleSpec with loader and origin

# Find a module without importing it
spec = importlib.util.find_spec('pathlib')
print(spec.origin)     # /usr/lib/python3.x/pathlib.py
print(spec.loader)     # SourceFileLoader

# Trace what happens during import
import importlib.machinery

# sys.path controls where Python looks
print("Import search path (first 3):")
for p in sys.path[:3]:
    print(f"  {p}")

# Import timing - where is startup time spent?
# Run: python -X importtime your_script.py
# This prints a tree of import times

# Manual timing of a cold import
import sys
if 'pandas' in sys.modules:
    del sys.modules['pandas']
    # Also need to remove sub-modules but this is the idea

# Lazy imports - defer expensive imports until first use
class LazyLoader:
    """Delays import until attribute access."""
    def __init__(self, module_name):
        self._module_name = module_name
        self._module = None

    def __getattr__(self, name):
        if self._module is None:
            import importlib
            self._module = importlib.import_module(self._module_name)
        return getattr(self._module, name)

# Usage
np = LazyLoader('numpy')  # Does NOT import numpy yet
# np.array([1,2,3])       # Now imports numpy on first use

Import Hooks - Extending the Import System

import sys

class DebugFinder:
    """Meta path finder that logs every import attempt."""

    def find_spec(self, fullname, path, target=None):
        print(f"Importing: {fullname} (path={path})")
        return None   # None = pass to next finder

# Insert before other finders to intercept all imports
sys.meta_path.insert(0, DebugFinder())

import csv   # Will print each file in the import chain
sys.meta_path.pop(0)  # Remove our spy

# .pth files - how site-packages advertises itself
import site
print("Site-packages directories:")
for d in site.getsitepackages():
    print(f"  {d}")

Why PyPy Is Faster

PyPy is an alternative Python implementation that uses JIT compilation. Understanding why it is faster clarifies what makes CPython slow.

The fundamental issue: CPython cannot know types at compile time. Every BINARY_OP must dynamically look up the operation on the operands' types at runtime. For code that runs in a tight loop with consistent types (which is most numerical code), this repeated type lookup is pure overhead.

PyPy's tracing JIT:

Runs code in interpreter mode initially
Detects "hot loops" that execute frequently (threshold: 1039 iterations)
Records a trace of all operations including the actual types seen
Compiles the trace to native machine code, specialized for those types
Future iterations use the compiled native code - no type checking, no vtable dispatch

The speedup:

CPython: x += 1
  LOAD_FAST       -> 1 array lookup
  LOAD_CONST      -> 1 array lookup
  INPLACE_ADD     -> type check -> dispatch to int.__iadd__ -> alloc/reuse int object
  STORE_FAST      -> 1 array write
  ~30 C operations total

PyPy (after JIT, knowing x is int):
  add [rbx], 1   -> 1 x86 instruction

For loop-heavy numerical code, PyPy is commonly 3-10x faster than CPython. It is slower at startup (JIT compilation takes time) and uses more memory. It is not always faster - code that creates many different types confounds the JIT's type specialization.

Production Engineering Notes

Profile before optimizing. Most Python performance problems are not where you expect. cProfile gives function-level timing; line_profiler (pip install) gives line-level timing.

import cProfile
import pstats
import io

# Profile a function
profiler = cProfile.Profile()
profiler.enable()

# ... your code here ...
result = sorted(range(1_000_000))

profiler.disable()

# Print top 20 functions by cumulative time
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream).sort_stats('cumulative')
stats.print_stats(20)
print(stream.getvalue())

# Or use context manager style
with cProfile.Profile() as pr:
    result = sorted(range(1_000_000))
pr.print_stats(sort='cumulative')

GIL in practice - the decision tree:

CPU-bound work that must parallelize: use multiprocessing or concurrent.futures.ProcessPoolExecutor
I/O-bound concurrency: use threading or asyncio
Numerical work in NumPy/SciPy: threading works (GIL is released in C)
Pure Python computation: threading gives no speedup

Import time at scale. Lambda functions and microservices pay for import time on every cold start. Run python -X importtime script.py to see the full import tree with timing. Consider lazy imports for rarely-used heavy modules.

# Checking for memory leaks in long-running processes
import tracemalloc
import gc

class MemoryTracker:
    def __init__(self):
        self.snapshot = None

    def start(self):
        tracemalloc.start()
        gc.collect()
        self.snapshot = tracemalloc.take_snapshot()

    def report(self, top_n=10):
        gc.collect()
        current = tracemalloc.take_snapshot()
        stats = current.compare_to(self.snapshot, 'lineno')
        print(f"Top {top_n} memory increases:")
        for stat in stats[:top_n]:
            print(f"  {stat}")

tracker = MemoryTracker()
tracker.start()
# ... run your code ...
tracker.report()

:::tip dis Is Your Performance Debugger When you cannot figure out why a Python function is slow, dis.dis(func) often reveals the answer. Too many LOAD_GLOBAL opcodes in a hot loop? Cache the global in a local variable. Unexpected CALL instructions? An operator is dispatching through Python instead of a fast C path. The bytecode does not lie. :::

Common Mistakes

:::danger Threading for CPU-Bound Work threading.Thread with CPU-bound Python code does not parallelize. The GIL ensures only one thread executes Python bytecode at any moment. Two CPU-bound threads on a 64-core machine use exactly 1 core. Use multiprocessing.Process or concurrent.futures.ProcessPoolExecutor for real CPU parallelism. Use threading only for I/O-bound work or when calling C extensions that release the GIL. :::

:::danger Assuming del Frees Memory del x decrements the reference count. If other references exist - in a container, a closure, a cycle, a C extension holding a reference - the object is NOT freed. del x followed by gc.collect() handles cycles. But even then, pymalloc may not return the memory to the OS due to arena fragmentation. Monitor with tracemalloc or OS-level RSS, not del alone. :::

:::warning Integer Identity Comparison Never use is to compare integers (except None, True, False). a is b checks object identity (same memory address). For integers, CPython caches -5 through 256, so a = 100; b = 100; a is b is True. But a = 1000; b = 1000; a is b is False. Always use == for value comparison. This bug is invisible in unit tests (small integers) but appears in production (large integers). :::

:::warning Mutable Default Arguments def f(x=[]): creates the list once when the function is defined, not on each call. Subsequent calls that mutate x will see each other's changes. This is because default values are stored in __code__.co_consts / the function's __defaults__ tuple - created at definition time. The fix: def f(x=None): if x is None: x = []. :::

Interview Q&A

Q1: Explain the GIL. Why does it exist? When does it matter and when does it not?

The GIL (Global Interpreter Lock) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. It exists because CPython uses reference counting for memory management, and reference count operations are not atomic. Without the GIL, two threads simultaneously modifying an object's reference count could corrupt memory.

It matters for: CPU-bound threading (two Python threads on a 16-core machine use 1 core total).

It does NOT matter for: I/O-bound threading (GIL released during all I/O syscalls), C extension operations that explicitly release the GIL (NumPy array operations, database drivers, most network libraries), and multiprocessing (each process has its own interpreter and GIL).

Python 3.12 introduced PEP 703 (free-threaded CPython as a build option) and Python 3.13 ships experimental builds without the GIL. Removing it required making reference counting atomic and adding thread-safe data structures throughout CPython.

Q2: What is a Python code object and what does it contain?

A PyCodeObject is the compiled representation of a Python function or module. It contains:

co_code: raw bytecode as bytes
co_consts: tuple of constants (literals) used in the code
co_varnames: names of local variables
co_names: names of globals and attributes referenced
co_freevars: variables captured from enclosing scope (closures)
co_cellvars: local variables captured by inner functions
co_filename, co_firstlineno: source location for tracebacks
co_linetable: maps bytecode offsets to source line numbers

Code objects are distinct from function objects. A function object wraps a code object and adds: a closure (tuple of cell objects), a global namespace pointer, default argument values, and keyword-only defaults. You can share one code object across multiple function objects with different closures.

Q3: How does Python's memory allocator work? Why does Python not return memory to the OS?

CPython uses pymalloc, a three-level allocator for objects up to 512 bytes.

Arenas (256KB each) are obtained from the OS via mmap. Each arena is divided into pools (4KB each), where each pool serves exactly one size class. Size classes cover 8 to 512 bytes in 8-byte increments. A request for 17 bytes gets a 24-byte block from the pool for the 24-byte size class.

Python only returns an arena to the OS when every pool in it is completely empty. In practice, long-running processes have persistent objects (module-level variables, interned strings, class objects, cached functions) distributed across every arena. No arena ever becomes completely empty, so arenas are never returned. The memory is free within pymalloc (available for new allocations) but invisible to the OS. This looks like a memory leak.

The practical fix: use gc.collect() + process recycling for long-running services.

Q4: Trace what happens when Python imports a module for the first time vs. a subsequent import.

First import of json:

IMPORT_NAME opcode calls __import__('json')
importlib checks sys.modules['json'] - not found
Finders in sys.meta_path are consulted: BuiltinImporter, FrozenImporter, PathFinder
PathFinder finds json/__init__.py on sys.path
A ModuleSpec is created with path and SourceFileLoader
A new module object is created and registered in sys.modules['json'] BEFORE execution (to handle circular imports correctly)
The loader reads the source, checks for a valid .pyc in __pycache__/, compiles if needed
The compiled bytecode executes in the module's namespace
The populated module is returned

Second import: sys.modules['json'] lookup succeeds, returns the cached module. Cost: one dictionary lookup, typically under 1 microsecond.

Q5: What is the difference between LOAD_FAST, LOAD_GLOBAL, and LOAD_DEREF? Why does this matter for performance?

LOAD_FAST: loads from fastlocals[], a C array in the frame. One array index lookup. Fastest possible.
LOAD_GLOBAL: looks up a name in f_globals (a dict) and then f_builtins (another dict) if not found. Two dictionary lookups. Significantly slower.
LOAD_DEREF: loads from a PyCellObject (for closures). One pointer dereference to the cell, then one pointer dereference to the value. Slightly slower than LOAD_FAST.

In tight loops, the difference is measurable. The standard optimization:

import math

# Each iteration: LOAD_GLOBAL (math) + LOAD_ATTR (sqrt)
def slow_sqrt_sum(n):
    total = 0.0
    for i in range(n):
        total += math.sqrt(i)
    return total

# Each iteration: LOAD_FAST (sqrt)
def fast_sqrt_sum(n):
    sqrt = math.sqrt   # One-time LOAD_GLOBAL, cached in local
    total = 0.0
    for i in range(n):
        total += sqrt(i)
    return total

For a loop running 10 million iterations, this optimization typically gives 15-25% speedup for the affected lookups.

Q6: How does CPython handle reference cycles? Give a real-world example.

Reference cycles occur when object A holds a reference to B and B holds a reference back to A. Reference counting alone cannot free them: neither object's count reaches zero when external references are removed.

CPython's cycle GC maintains a linked list of all "container" objects (those capable of holding references to other objects). During collection, it simulates what the reference counts would be if all external references were removed. Objects with simulated count of zero are unreachable and form a cycle - they are freed.

Real-world example: a web framework's request context holding a reference to a logger, which holds a reference back to the request context for request-scoped log formatting. In a high-traffic server processing 10,000 requests/second, thousands of these cycles accumulate per second. Generation 2 GC collections (triggered infrequently but collecting all three generations) can pause all Python threads for tens of milliseconds - causing timeout spikes in production.

Fix: use weakref.ref for the back-reference. weakref does not increment the reference count, so the cycle is broken and reference counting can free the objects immediately.

Q7: What really happens when you write x = 1000 twice in two separate lines? Are they the same object?

It depends on context. At the module level or in interactive REPL:

x = 1000
y = 1000
print(x is y)   # Usually False - two separate PyLongObject allocations

But inside a single function or code block, CPython's peephole optimizer and constant folding may deduplicate constants:

def f():
    x = 1000
    y = 1000
    return x is y

print(f())  # True! Both 1000 literals appear in co_consts as the same object

This is why you should never rely on is for integer comparison. The behavior is an implementation detail that can change between Python versions, between code contexts (function vs. module), and with future optimizations. Always use == for value comparison.

A Production Mystery​

Why This Exists - The Problem Before CPython​

Historical Context - Who Built This and When​

The CPython Pipeline - Source to Execution​

Stage 1: Lexing and Tokenization​

Stage 2: Parsing and the AST​

Stage 3: Compilation to Bytecode​

Inspecting Bytecode with dis​

The .pyc File Format​

The CPython Virtual Machine​

How Opcodes Actually Work - Tracing x + y​

The Python Object Model​

String Interning​

Reference Counting and the Cycle Garbage Collector​

The GIL - Global Interpreter Lock​

GIL Switching Mechanics​

C Extensions and the GIL​

Memory Management - pymalloc​

The Import System​

Import Hooks - Extending the Import System​

Why PyPy Is Faster​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​