Heap and Stack Memory
The Production Scenario
It is 2 AM and a model serving container running on a 16 GB machine has just crashed. The error is not an out-of-memory exception from PyTorch - it is a segmentation fault deep inside a CPython internal. The stack trace is useless, five frames deep into interpreter code before touching anything you wrote. You restart the service, it runs fine for six hours, then crashes again. Intermittent. Reproducible only under load.
The engineer who eventually debugged this spent four days on it. The root cause was a single Python function that allocated a 400 KB NumPy array on every request and stored a reference to it in a module-level list for "debug logging." Over six hours the list grew to consume 8 GB. When the next large batch arrived, the combined allocation crossed the OS limit and the kernel OOM killer terminated the process. The segfault was a red herring - the real crash happened 30 milliseconds earlier, and the segfault was just the CPython interpreter discovering a deallocated object during cleanup.
The engineer who debugged it could not explain why the module-level list held 8 GB. They knew Python had garbage collection. They assumed GC would clean it up. What they did not understand was that a live reference in a module-level variable is a root - the GC will never collect it, because by definition it is reachable. Understanding this required understanding the difference between the stack (where references live when a function is running) and the heap (where objects actually live), and what "reachable" means in each context.
This lesson starts at the hardware level and works up. We will look at how the stack and heap are physically laid out in virtual memory, how Python's allocator sits on top of the OS, and what every Python object actually costs in bytes. By the end you will be able to look at any Python data structure and reason about exactly where each byte lives, why it is there, and when it will be freed.
The material here is not academic. Every AI engineer who has debugged a GPU OOM error, optimized a DataLoader memory footprint, or profiled a serving endpoint has needed this mental model. The engineers who debug these problems quickly are the ones who have it. The engineers who spend four days on a module-level list are the ones who do not.
Why This Exists
Early programs had no separation between stack and heap. Programmers allocated memory manually, and the entire memory space was a flat array they managed by hand. This worked for small programs but collapsed at scale: there was no way to have functions that called other functions without carefully tracking which bytes belonged to which call, and there was no way to allocate data whose size was unknown at compile time.
The stack/heap split solves two distinct problems. The stack solves the problem of function call bookkeeping - when function A calls function B calls function C, you need to track three sets of local variables and three return addresses, and you need to clean them up in exactly the right order. The heap solves the problem of dynamic allocation - data whose size is only known at runtime, data that needs to outlive the function that created it, and data that is too large to fit on the stack.
Python adds a third layer on top of this. Because Python objects are reference-counted and garbage-collected, and because the Python runtime itself is a C program, Python has to manage its own internal heap on top of the OS heap. This is pymalloc - a custom allocator that sits between CPython and the OS's malloc, optimizing for Python's specific allocation patterns. Understanding all three layers - hardware stack, OS heap, Python heap - is what lets you reason precisely about where memory goes in a running ML system.
Historical Context
The stack as a data structure was formalized by Friedrich Bauer and Klaus Samelson in 1957, who patented the "cellar principle" for storing return addresses during subroutine calls. The idea of a call stack separate from the data heap appears in ALGOL 60 (1960), the first language to have block structure and recursive functions. ALGOL needed a mechanism to allocate a new set of local variables on every function entry and reclaim them on every function exit - the LIFO stack was the obvious solution.
The heap allocator as we know it (dynamic allocation with malloc/free semantics) was standardized in C in the 1970s. The Unix sbrk() system call, which extends the data segment, was the primitive on which early allocators were built. Doug Lea's dlmalloc (1987) became the basis for glibc malloc and remains influential in allocator design today.
Python's pymalloc was introduced by Vladimir Marangozov in Python 2.1 (2001). The observation that motivated it was that CPython's internal allocation pattern is almost entirely small objects - integers, strings, tuples, dicts - with sizes under 512 bytes. Calling malloc() for every small object has two problems: glibc malloc has per-call overhead (~100 ns), and it has poor cache locality because objects are scattered across the heap. pymalloc solves both by pre-allocating large chunks (arenas of 256 KB) and managing small allocations internally with minimal overhead. The design has remained essentially unchanged from Python 2.1 to Python 3.12.
Core Concepts
The Stack: A LIFO Region of Memory
The stack is a contiguous region of virtual memory that grows and shrinks as functions are called and return. Every process has one main stack (plus one per thread). On x86-64 Linux, the default stack size is 8 MB per thread. On macOS, it is 8 MB for the main thread and 512 KB for secondary threads.
When a function is called, the CPU pushes a stack frame (also called an activation record) onto the stack. A stack frame contains:
- Return address - where to jump when the function returns
- Saved frame pointer - the previous frame's base address, so we can restore it
- Local variables - all variables declared inside the function
- Function arguments - passed values (spilled from registers if there are more than six on x86-64)
- Saved registers - any caller-saved registers the function will clobber
The stack pointer (RSP on x86-64) always points to the top of the stack. Pushing a frame means subtracting from RSP (the stack grows downward toward lower addresses). Popping a frame means adding back to RSP. This is a single CPU instruction - stack allocation and deallocation are essentially free at the hardware level.
Stack frames are fixed size at compile time (in compiled languages like C/C++). The compiler knows all local variable sizes at compile time and can allocate the entire frame in one instruction. In CPython, each Python frame object is heap-allocated, but the C-level stack frames for CPython's own execution are stack-allocated in the normal way.
Stack Overflow: When the Stack Runs Out
Stack overflow occurs when the stack grows past its limit. The OS marks a "guard page" just below the stack's end as unreadable. When the stack pointer crosses into this page, the CPU triggers a page fault, which the OS converts to a SIGSEGV (segmentation fault).
In Python, deep recursion causes this. Python's default recursion limit is 1000 - this is a Python-level limit set below the hardware limit to give a clean Python exception rather than a hard crash:
import sys
import time
import inspect
# Default recursion limit
print(sys.getrecursionlimit()) # 1000
# Demonstrate stack overflow behavior
def deep_recurse(n):
return deep_recurse(n + 1)
try:
deep_recurse(0)
except RecursionError:
print(f"RecursionError caught at limit {sys.getrecursionlimit()}")
# Inspect the call stack with the inspect module
def show_frame_info():
frame = inspect.currentframe()
depth = 0
while frame is not None:
info = inspect.getframeinfo(frame)
print(f" Depth {depth}: {info.filename}:{info.lineno} in {info.function}")
frame = frame.f_back
depth += 1
if depth > 5:
print(" ...")
break
def caller_b():
show_frame_info()
def caller_a():
caller_b()
caller_a()
# Recursion depth benchmark: recursion vs iteration
def recursive_sum(n):
if n <= 0:
return 0
return n + recursive_sum(n - 1)
def iterative_sum(n):
total = 0
for i in range(n + 1):
total += i
return total
# Recursive: limited by recursion depth, slower per call due to frame allocation
sys.setrecursionlimit(2000)
start = time.perf_counter()
result = recursive_sum(900)
elapsed = time.perf_counter() - start
print(f"Recursive sum(900): {elapsed*1e6:.1f} us, result={result}")
# Iterative: no limit, faster
start = time.perf_counter()
result = iterative_sum(900)
elapsed = time.perf_counter() - start
print(f"Iterative sum(900): {elapsed*1e6:.1f} us, result={result}")
sys.setrecursionlimit(1000) # restore default
Setting sys.setrecursionlimit to a very high value (e.g., 100000) does not increase the OS stack size. It just delays Python's Python-level check. If the stack actually grows that deep, you will get a hard segmentation fault instead of a clean RecursionError. The correct fix for deep recursion is to rewrite as iteration or use an explicit stack data structure.
The Heap: Dynamic Allocation
The heap is the region of virtual memory where dynamically allocated objects live. Unlike the stack, the heap has no inherent ordering - objects are allocated and freed in arbitrary order. The allocator's job is to find a free region of the right size (fast), hand it out, and reclaim it when freed.
The fundamental operations are:
- malloc(n) - find a contiguous free region of at least n bytes, mark it allocated, return a pointer
- free(ptr) - mark the region starting at ptr as available for reuse
The challenge is fragmentation. After many allocate/free cycles, the heap can have large amounts of free memory scattered in small non-contiguous chunks, making it impossible to satisfy a large allocation even though total free memory is sufficient.
In this state, if you request 200 bytes, none of the three free chunks can satisfy it even though there are 192 bytes free total. This is external fragmentation - the free memory exists but is not contiguous.
C Memory Layout: Struct Padding and Alignment
Understanding struct layout in C is essential when writing Python C extensions, working with ctypes, or analyzing NumPy's buffer protocol. The CPU requires that data be aligned - a 4-byte integer must be at an address divisible by 4, an 8-byte double at an address divisible by 8. The compiler inserts padding bytes to satisfy these constraints.
// This C struct is NOT 13 bytes - it is 24 bytes due to padding
struct Example {
char a; // 1 byte at offset 0
// 3 bytes padding (to align b to 4-byte boundary)
int b; // 4 bytes at offset 4
char c; // 1 byte at offset 8
// 7 bytes padding (to align d to 8-byte boundary)
double d; // 8 bytes at offset 16
}; // Total: 24 bytes (not 14!)
// Reordering fields reduces padding
struct Efficient {
double d; // 8 bytes at offset 0
int b; // 4 bytes at offset 8
char a; // 1 byte at offset 12
char c; // 1 byte at offset 13
// 2 bytes padding (align struct to 8-byte boundary)
}; // Total: 16 bytes (vs 24 above)
You can inspect this from Python using ctypes:
import ctypes
class ExampleStruct(ctypes.Structure):
_fields_ = [
("a", ctypes.c_char),
("b", ctypes.c_int),
("c", ctypes.c_char),
("d", ctypes.c_double),
]
class EfficientStruct(ctypes.Structure):
_fields_ = [
("d", ctypes.c_double),
("b", ctypes.c_int),
("a", ctypes.c_char),
("c", ctypes.c_char),
]
print(f"ExampleStruct size: {ctypes.sizeof(ExampleStruct)} bytes")
print(f"EfficientStruct size: {ctypes.sizeof(EfficientStruct)} bytes")
# Inspect field offsets
print("\nExampleStruct field offsets:")
for name, _ in ExampleStruct._fields_:
field = getattr(ExampleStruct, name)
print(f" {name}: offset={field.offset}, size={field.size}")
print("\nEfficientStruct field offsets:")
for name, _ in EfficientStruct._fields_:
field = getattr(EfficientStruct, name)
print(f" {name}: offset={field.offset}, size={field.size}")
# Packed struct - no padding, potentially slower due to unaligned access
class PackedStruct(ctypes.Structure):
_pack_ = 1 # force 1-byte alignment
_fields_ = [
("a", ctypes.c_char),
("b", ctypes.c_int),
("c", ctypes.c_char),
("d", ctypes.c_double),
]
print(f"\nPackedStruct size: {ctypes.sizeof(PackedStruct)} bytes") # 14 bytes
# Alignment demonstration: why alignment matters
print("\nAlignment of standard C types:")
for name, ctype in [
("c_char", ctypes.c_char),
("c_short", ctypes.c_short),
("c_int", ctypes.c_int),
("c_long", ctypes.c_long),
("c_float", ctypes.c_float),
("c_double", ctypes.c_double),
("c_void_p", ctypes.c_void_p),
]:
print(f" {name:12s}: size={ctypes.sizeof(ctype)}, alignment={ctypes.alignment(ctype)}")
Packed structs (_pack_ = 1 in ctypes, __attribute__((packed)) in C) eliminate padding but cause unaligned memory accesses. On x86-64 this is legal but slow (2x slower on cache misses). On ARM it can cause a bus error (SIGBUS). Never use packed structs in performance-critical code unless you are serializing to disk or network.
Python's Memory Model: Everything is Heap-Allocated
In Python, every object lives on the heap. There is no such thing as a stack-allocated Python object. When you write x = 42 in a Python function, the integer object 42 is a heap-allocated C struct. The stack frame holds a pointer to that struct, not the struct itself.
This is the fundamental difference from C/C++: in C, int x = 42; puts the integer directly in the stack frame. In Python, the integer is a heap object and the frame just holds a reference (a pointer) to it.
The CPython representation of every Python object starts with a common header:
// Every Python object starts with this 16-byte header
typedef struct _object {
Py_ssize_t ob_refcnt; // 8 bytes: reference count
PyTypeObject *ob_type; // 8 bytes: pointer to type object
} PyObject;
// PyLongObject (Python int) - simplified
typedef struct {
PyObject ob_base; // 16 bytes: the header above
Py_ssize_t ob_size; // 8 bytes: number of digits (0 for zero)
digit ob_digit[1]; // variable: actual digit array (4 bytes each)
} PyLongObject;
// Minimum size for a small int: 16 + 8 + 4 = 28 bytes
We can verify this from Python:
import sys
print("Size of common Python objects:")
print(f" int(0): {sys.getsizeof(0)} bytes")
print(f" int(1): {sys.getsizeof(1)} bytes")
print(f" int(2**30): {sys.getsizeof(2**30)} bytes")
print(f" int(2**60): {sys.getsizeof(2**60)} bytes")
print(f" int(2**90): {sys.getsizeof(2**90)} bytes")
print(f" float: {sys.getsizeof(1.0)} bytes")
print(f" bool: {sys.getsizeof(True)} bytes")
print(f" empty str: {sys.getsizeof('')} bytes")
print(f" str 'a': {sys.getsizeof('a')} bytes")
print(f" str 'hello': {sys.getsizeof('hello')} bytes")
print(f" empty list: {sys.getsizeof([])} bytes")
print(f" list [1]: {sys.getsizeof([1])} bytes")
print(f" list [1,2,3]: {sys.getsizeof([1,2,3])} bytes")
print(f" list x10: {sys.getsizeof(list(range(10)))} bytes")
print(f" empty dict: {sys.getsizeof({})} bytes")
print(f" dict one entry: {sys.getsizeof({'a': 1})} bytes")
print(f" empty tuple: {sys.getsizeof(())} bytes")
print(f" tuple (1,2,3): {sys.getsizeof((1,2,3))} bytes")
print(f" None: {sys.getsizeof(None)} bytes")
# Observe int growth: Python ints are arbitrary precision
# Each 30-bit "digit" adds 4 bytes
for bits in [0, 30, 60, 90, 120]:
n = 2**bits if bits > 0 else 0
print(f" int(2**{bits:3d}): {sys.getsizeof(n)} bytes")
# sys.getsizeof is SHALLOW: does not include referenced objects
# A list's size is the list header + pointers to elements
# Each pointed-to object costs additional bytes
def deep_sizeof(obj, seen=None):
"""Recursively compute total memory including all referenced objects."""
if seen is None:
seen = set()
obj_id = id(obj)
if obj_id in seen:
return 0
seen.add(obj_id)
size = sys.getsizeof(obj)
if isinstance(obj, dict):
size += sum(deep_sizeof(k, seen) + deep_sizeof(v, seen)
for k, v in obj.items())
elif isinstance(obj, (list, tuple, set, frozenset)):
size += sum(deep_sizeof(item, seen) for item in obj)
elif hasattr(obj, '__dict__'):
size += deep_sizeof(obj.__dict__, seen)
return size
# Python list vs NumPy array comparison - critical for ML
test_list = list(range(1000))
print(f"\nList of 1000 ints:")
print(f" Shallow (list container only): {sys.getsizeof(test_list):>8,} bytes")
print(f" Deep (all int objects): {deep_sizeof(test_list):>8,} bytes")
# Each int in -5..256 is cached (no extra cost), but 257..999 are new objects
import numpy as np
arr = np.arange(1000, dtype=np.int64)
print(f"\nNumPy array of 1000 int64:")
print(f" sys.getsizeof (header only): {sys.getsizeof(arr):>8,} bytes")
print(f" arr.nbytes (actual data): {arr.nbytes:>8,} bytes")
# No per-element overhead - 1000 * 8 = 8000 bytes exactly
The Small Integer Cache and String Interning
CPython pre-allocates integer objects for values in the range . When you write x = 42, you get a pointer to a pre-existing object, not a newly allocated one. This eliminates allocation overhead for common small integers:
# Small integer cache: -5 to 256 are singletons
a = 42
b = 42
print(f"a is b (42): {a is b}") # True - same object in memory
print(f"id(a) == id(b): {id(a) == id(b)}")
c = 257
d = 257
# In interactive mode these are different objects
# In compiled .py files they may be the same due to constant folding
print(f"a is b (257): {c is d}") # usually False
# String interning: CPython automatically interns "identifier-like" strings
s1 = "hello"
s2 = "hello"
print(f"\n'hello' is 'hello': {s1 is s2}") # True
s3 = "hello world"
s4 = "hello world"
print(f"'hello world' is same: {s3 is s4}") # uncertain (has space)
# Force interning for performance-critical string comparisons
s5 = sys.intern("hello world")
s6 = sys.intern("hello world")
print(f"interned 'hello world': {s5 is s6}") # True
# Why interning matters for ML: feature names in a feature store
# Without interning: dict lookup compares characters, O(len)
# With interning: dict lookup compares pointer IDs, O(1)
feature_names = [sys.intern(f"feature_{i}") for i in range(1000)]
import timeit
# Lookup speed comparison
regular_key = "feature_500"
interned_key = sys.intern("feature_500")
feature_dict = {sys.intern(f"feature_{i}"): i for i in range(1000)}
t_regular = timeit.timeit(lambda: feature_dict.get(regular_key), number=100000)
t_interned = timeit.timeit(lambda: feature_dict.get(interned_key), number=100000)
print(f"\nDict lookup (100K times):")
print(f" Regular string key: {t_regular*1000:.2f} ms")
print(f" Interned string key: {t_interned*1000:.2f} ms")
The small integer cache and string interning are CPython implementation details, not language guarantees. Never use is to compare values - always use ==. The is operator tests object identity (same memory address), not value equality.
pymalloc: Python's Custom Allocator
pymalloc is the allocator that CPython uses for all Python objects smaller than 512 bytes (in Python 3.x). It sits between CPython and the OS's malloc:
The pymalloc hierarchy has three levels:
Arenas - 256 KB chunks requested from the OS via mmap(). Arenas are the coarsest unit. When all pools in an arena are free, the arena is returned to the OS. Python typically keeps between 1 and 16 arenas active at a time.
Pools - 4 KB pages within an arena. Each pool serves exactly one size class. A pool serving 32-byte requests holds slots. A pool is "full" when all slots are allocated, "used" when some are free, and "empty" when all slots are free.
Blocks - individual allocation units within a pool. Each block is exactly one size class. Size classes are multiples of 8 bytes: 8, 16, 24, 32, ..., 512 bytes.
The size class formula: for a request of bytes, the size class is . So:
- size class 8
- size class 24 (4 bytes wasted per allocation: internal fragmentation)
- size class 512 (maximum pymalloc handles)
- falls through to OS malloc
Arena (256 KB = 64 pools of 4 KB each)
+--Pool 0 (4 KB)--+--Pool 1 (4 KB)--+--...--+--Pool 63 (4 KB)--+
| size class: 32B | size class: 64B | | size class: 512B |
| 128 blocks | 64 blocks | | 8 blocks |
| [ ][ ][x][ ]... | [ ][x][x][ ].. | | [x][x][ ]... |
+------------------+-----------------+ +-------------------+
x = allocated, [ ] = free
The benefit is speed: allocating from a pool is a pointer dereference and linked-list update - essentially free compared to a general-purpose malloc. Cache locality is also excellent because objects of the same size live in the same pool.
# Observing pymalloc behavior via tracemalloc
import tracemalloc
import sys
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
# Allocate many small dicts - all go through pymalloc
small_objects = [{'x': i, 'y': i * 2} for i in range(10000)]
snapshot2 = tracemalloc.take_snapshot()
stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top memory increases after allocating 10K dicts:")
for stat in stats[:5]:
print(f" {stat}")
total_increase = sum(s.size_diff for s in stats)
print(f"Total increase: {total_increase:,} bytes")
del small_objects
import gc
gc.collect()
snapshot3 = tracemalloc.take_snapshot()
stats_after = snapshot3.compare_to(snapshot1, 'lineno')
print("\nAfter cleanup (should be near zero):")
total_remaining = sum(s.size_diff for s in stats_after if s.size_diff > 0)
print(f"Remaining increase: {total_remaining:,} bytes")
tracemalloc.stop()
Stack vs Heap in C: A Direct Comparison
In C, you have explicit control over where data lives:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Stack allocation: fast, automatic cleanup, size must be known at compile time
void stack_example() {
int arr[1000]; // 4000 bytes on the stack
arr[0] = 42;
// arr is automatically freed when this function returns
printf("Stack array[0] = %d\n", arr[0]);
} // RSP += sizeof(arr): single instruction to reclaim
// Heap allocation: flexible, manual cleanup required, can outlive function
int* heap_example(int n) {
int* arr = malloc(n * sizeof(int));
if (arr == NULL) {
return NULL; // malloc failed (OOM)
}
for (int i = 0; i < n; i++) {
arr[i] = i * i;
}
return arr; // caller MUST call free(arr)
}
// Variable-length array (C99): stack allocated, runtime size - dangerous
void vla_example(int n) {
int arr[n]; // stack allocated but size known only at runtime
// If n > ~500K, this WILL crash with a stack overflow
memset(arr, 0, n * sizeof(int));
printf("VLA arr[0] = %d\n", arr[0]);
}
int main() {
stack_example();
// 4 MB on the heap: fine
int* heap_arr = heap_example(1000000);
if (heap_arr) {
printf("heap_arr[999999] = %d\n", heap_arr[999999]);
free(heap_arr); // required: without this, memory leak
}
// Safe for small n
vla_example(100);
// UNSAFE: 4 MB VLA on an 8 MB stack
// vla_example(1000000); // stack overflow
return 0;
}
In Python, you never write malloc or free - but equivalent operations happen every time you create or delete an object. The key difference is that Python's GC handles the free side automatically, which eliminates use-after-free and double-free bugs but adds overhead and makes memory behavior less predictable.
Process Virtual Memory Layout
For a Python process, the "heap region" is where pymalloc arenas live, along with NumPy arrays (which use the OS allocator directly for large buffers), and any C extension data. The text segment holds the compiled CPython bytecode interpreter (python3 binary).
Production Engineering Notes
Measuring Actual Object Memory in Python
sys.getsizeof is shallow - it does not include referenced objects. For a list of dictionaries (a common batch representation in ML), you need deep measurement:
import sys
from collections import deque
def total_size(obj, seen=None):
"""
Total memory footprint including all transitively referenced objects.
Handles circular references via the seen set.
"""
if seen is None:
seen = set()
obj_id = id(obj)
if obj_id in seen:
return 0
seen.add(obj_id)
size = sys.getsizeof(obj)
if isinstance(obj, dict):
size += sum(total_size(v, seen) for v in obj.values())
size += sum(total_size(k, seen) for k in obj.keys())
elif hasattr(obj, '__dict__'):
size += total_size(obj.__dict__, seen)
elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
size += sum(total_size(i, seen) for i in obj)
return size
# Simulate a training batch: 32 examples, each with 512 token ids and a label
batch = [
{'input_ids': list(range(512)), 'label': i % 10}
for i in range(32)
]
shallow = sys.getsizeof(batch)
deep = total_size(batch)
print(f"Batch shallow size: {shallow:>10,} bytes ({shallow/1024:.1f} KB)")
print(f"Batch deep size: {deep:>10,} bytes ({deep/1024/1024:.2f} MB)")
# As NumPy for comparison
import numpy as np
batch_ids = np.array([[i % 1000 for i in range(512)] for _ in range(32)], dtype=np.int32)
batch_labels = np.arange(32, dtype=np.int32)
numpy_bytes = batch_ids.nbytes + batch_labels.nbytes
print(f"\nSame batch as NumPy: {numpy_bytes:>10,} bytes ({numpy_bytes/1024:.1f} KB)")
print(f"Memory ratio (Python list vs NumPy): {deep/numpy_bytes:.1f}x")
Stack Size for Python Threads
When creating threads for data loading, be aware of per-thread stack costs:
import threading
# Each Python thread gets its own stack
# Default: OS stack size (usually 8 MB on Linux/macOS)
print(f"Default thread stack size: {threading.stack_size()} bytes (0 = OS default)")
# For DataLoader workers doing simple preprocessing, 2 MB is usually enough
# This reduces virtual address space consumption by 75% per thread
threading.stack_size(2 * 1024 * 1024) # 2 MB per thread
print(f"Set stack size to: {threading.stack_size():,} bytes")
# Example: 32 workers
# Default: 32 * 8 MB = 256 MB virtual reserved for stacks
# With 2MB: 32 * 2 MB = 64 MB virtual reserved for stacks
# (Virtual memory, not physical - but still relevant for address space limits)
threading.stack_size(0) # restore OS default before creating threads
NumPy Memory Layout and Cache Performance
NumPy arrays bypass pymalloc for large allocations. They call the OS allocator directly. This means sys.getsizeof(np.zeros(1_000_000)) returns only the array header (~112 bytes), not the actual data:
import numpy as np
import sys
import time
arr = np.zeros((1000, 1000), dtype=np.float32)
print(f"sys.getsizeof(arr): {sys.getsizeof(arr)} bytes") # ~112 (header only)
print(f"arr.nbytes: {arr.nbytes:,} bytes") # 4,000,000 (actual data)
print(f"arr.itemsize: {arr.itemsize} bytes")
print(f"arr.dtype: {arr.dtype}")
print(f"C-contiguous: {arr.flags['C_CONTIGUOUS']}") # row-major
print(f"F-contiguous: {arr.flags['F_CONTIGUOUS']}") # column-major
# Memory layout affects cache performance dramatically
# C-order: arr[i, j] is adjacent to arr[i, j+1] in memory (row-major)
# F-order: arr[i, j] is adjacent to arr[i+1, j] in memory (column-major)
c_arr = np.zeros((2000, 2000), dtype=np.float64, order='C')
f_arr = np.zeros((2000, 2000), dtype=np.float64, order='F')
start = time.perf_counter()
for i in range(2000):
_ = c_arr[i, :].sum() # row access on C-order: cache-friendly
c_time = time.perf_counter() - start
start = time.perf_counter()
for i in range(2000):
_ = f_arr[i, :].sum() # row access on F-order: cache-unfriendly (every access skips 2000*8 bytes)
f_time = time.perf_counter() - start
print(f"\nC-order row sum (2000x2000): {c_time*1000:.2f} ms")
print(f"F-order row sum (2000x2000): {f_time*1000:.2f} ms")
print(f"Cache impact: {f_time/c_time:.1f}x slower for column-major")
Inspecting Python Frame Objects
Python frames are C structs (PyFrameObject) that live on the heap. You can inspect them at runtime:
import sys
import inspect
def analyze_frame():
frame = sys._getframe()
print(f"Frame object size: {sys.getsizeof(frame)} bytes")
print(f"f_code.co_name: {frame.f_code.co_name}")
print(f"f_locals: {list(frame.f_locals.keys())}")
print(f"f_lineno: {frame.f_lineno}")
print("\nCall stack:")
for i, (f, lineno) in enumerate(inspect.stack()):
print(f" [{i}] {f.filename}:{lineno} in {f.function}")
if i >= 3:
break
analyze_frame()
Storing traceback objects (e.g., sys.exc_info()[2]) keeps all local variables from that frame alive. This is a common memory leak pattern: a function with a 500 MB tensor raises an exception, you store sys.exc_info() for logging, and the 500 MB tensor stays allocated until the traceback object is garbage collected. Always clear tracebacks with tb = None or use traceback.format_exc() to extract the string and drop the object.
# Demonstration of traceback-induced memory retention
import sys
import gc
def leaky_function():
huge_data = list(range(1_000_000)) # ~8.5 MB
try:
raise ValueError("something went wrong")
except ValueError:
# Storing sys.exc_info()[2] keeps the frame alive
# Frame.f_locals includes huge_data
tb = sys.exc_info()[2]
return tb # returning tb keeps huge_data alive
print("Before leaky_function:")
gc.collect()
tb = leaky_function()
# huge_data is still in memory because:
# tb -> PyTracebackObject -> PyFrameObject -> f_locals -> huge_data
print(f"Traceback alive, huge_data is retained in memory")
# Fix: clear the traceback reference
tb = None
gc.collect()
print("After tb = None: huge_data can now be collected")
Interview Q&A
Q1: What is the difference between stack and heap memory?
The stack is a LIFO region of virtual memory used for function call bookkeeping: return addresses, local variables, and saved registers. It is managed automatically by the CPU (RSP register), and allocation/deallocation is a single instruction. Stack size is fixed (typically 8 MB per thread on Linux). The heap is dynamic memory managed by the allocator (malloc/free). Objects can be any size and live for arbitrary durations. Heap allocation requires finding a free block (slower than stack), but there is no inherent size limit beyond physical memory. In Python, all objects live on the heap; the stack holds only C-level execution frames for the CPython interpreter itself.
Q2: Why does Python have a recursion limit, and what happens if you remove it?
Python's default recursion limit of 1000 is a soft limit that causes a clean RecursionError before the C-level stack overflows. Without this limit, deep recursion would overflow the OS stack (typically 8 MB) and cause a segmentation fault - a hard crash with no Python traceback, making debugging nearly impossible. You can raise the limit with sys.setrecursionlimit(), but this does not increase the OS stack size. Setting it too high means you get a segfault instead of a RecursionError. The correct approach for deeply recursive algorithms is to convert to iteration or use an explicit stack data structure.
Q3: Why is sys.getsizeof([1, 2, 3]) only 120 bytes when each integer is 28 bytes?
sys.getsizeof returns the shallow size of the object - the size of the list container itself (the PyListObject header plus 8-byte pointers to elements). A list of 3 elements contains 3 pointers (8 bytes each) plus a header of approximately 56 bytes plus some over-allocation for append performance, totaling about 120 bytes. Each integer 1, 2, 3 is a separate heap-allocated PyLongObject of 28 bytes. The list stores pointers to these objects, not the objects themselves. The total (deep) memory is approximately bytes. This distinction is critical for ML: a Python list of 1 million floats uses about 8.5 MB just for the list header and pointers, plus 28 MB for the integer objects. A NumPy array of 1 million float64 values uses exactly 8 MB with no per-element overhead.
Q4: What is pymalloc and why does Python use it instead of calling malloc directly?
pymalloc is CPython's custom memory allocator for objects smaller than 512 bytes. It pre-allocates 256 KB arenas from the OS, divides them into 4 KB pools, and divides pools into fixed-size blocks (8 to 512 bytes in multiples of 8). This design has two advantages over calling malloc for every Python object. First, per-allocation overhead is near zero - finding a free block is a pointer dereference and a free-list update, versus glibc's general-purpose block search which may take ~100 ns. Second, cache locality is excellent - objects of the same size live in the same pool, so iterating over a list of same-type Python objects has few L1/L2 cache misses. The downside is internal fragmentation: a 20-byte object uses a 24-byte block (4 bytes wasted).
Q5: What is struct padding, and when does it matter for ML workloads?
Struct padding is extra bytes the compiler inserts between struct fields to ensure each field is at a memory address divisible by its natural alignment (4 bytes for int, 8 bytes for double). Without alignment, the CPU may require two memory fetches to read a single value, or raise a bus error on ARM. Padding matters for ML when designing ctypes structures for C extension interfaces, when using NumPy structured arrays with heterogeneous fields, and when analyzing the memory layout of model state dictionaries. For example, a struct with the field order char, int, char, double is 24 bytes due to padding; reordering to double, int, char, char reduces it to 16 bytes. For one million records, this is 24 MB versus 16 MB - a 33% reduction with a trivial code change.
Q6: Why are Python objects always heap-allocated rather than stack-allocated?
Three reasons. First, Python's reference-counting garbage collector needs to track every object's lifetime. Stack-allocated objects would disappear when the function returns, making it impossible to track references to them from other objects (e.g., closures, global variables). Second, Python closures can capture local variables that need to outlive the function call - impossible if those variables are stack-allocated and reclaimed on function return. Third, Python's type system is fully dynamic - the size of a list or dict may grow after creation, making fixed-size stack allocation impossible. The trade-off is that every Python object has a minimum 16-byte header overhead and requires a heap allocation, which is why NumPy, PyTorch, and other performance libraries store data in C arrays that bypass Python's object system.
Q7: How does string interning work in CPython, and when should you use sys.intern()?
CPython automatically interns strings that look like Python identifiers (alphanumeric, no spaces) and string literals appearing in compiled bytecode. Interning means there is at most one copy of the string in memory - all variables referencing it point to the same object. The benefit is O(1) equality comparison (pointer comparison is instead of character-by-character ==). sys.intern() forces interning of any string. Use it when: you have a large dict where the same string keys appear millions of times (e.g., feature names in a feature store), you are building a string-based cache wanting O(1) lookups, or you have a vocabulary table with repeated token strings. The cost is that interned strings are never freed - they live in a global intern table for the process lifetime. Never intern arbitrary user input or content from external sources.
