Garbage Collection Algorithms

The Production Scenario

A model serving system was handling 500 requests per second with a median latency of 12 ms. P99 latency was 85 ms - acceptable. Then someone added a feature: the server now returns detailed per-token attribution scores alongside each prediction. The attribution computation creates a lot of intermediate Python objects - lists of floats, small dicts, nested structures. The feature shipped. P99 latency jumped to 340 ms. Median latency stayed at 12 ms. Nothing else changed.

The on-call engineer spent two days profiling. CPU usage was unchanged. The response computation itself was not slower. The 340 ms latency spikes happened unpredictably, roughly every few hundred milliseconds, and they hit all concurrent requests at once - not just the one triggering the attribution computation. That "all at once" detail was the clue.

The cause was CPython's generational garbage collector. The new feature was creating enough short-lived Python objects that the generation 0 collection threshold was being crossed every few hundred milliseconds. When the GC runs, it briefly pauses the Python thread to collect cycles. During a 12 ms median response time, a 2-4 ms GC pause adds 25-33% to every request being processed in that instant. The P99 was catching the moments when a GC run overlapped with several concurrent requests.

The fix was three lines: tune the GC thresholds to collect less frequently, and disable the GC entirely during the hot path. P99 dropped to 95 ms. The lesson is not "disable GC everywhere." The lesson is that GC is a background service with real latency costs, and understanding its algorithm tells you exactly when and why it fires.

This lesson covers how CPython's reference counting and cyclic GC work, when they run, what they cost, and how to control them in production. We will also look at Java's GC algorithms for comparison - the lessons from JVM GC tuning apply directly to ML serving.

Why This Exists

The alternative to garbage collection is manual memory management: the programmer explicitly allocates and frees every object. C and C++ work this way. Manual management is fast (you control exactly when frees happen) but error-prone. The two most common bugs in C programs are use-after-free (accessing memory after calling free) and memory leaks (forgetting to call free). In a production ML system with millions of allocations per second, manual management is not feasible.

Garbage collection solves this by automating the determination of when objects are no longer needed. The key insight is that an object is "dead" when no live part of the program can ever reach it again - there is no pointer path from any live variable to that object. GC algorithms differ in how they detect this condition and when they run the detection.

Python uses two complementary approaches: reference counting for immediate collection of objects with no cycles, and a cyclic garbage collector for objects that form reference cycles that reference counting cannot break. Understanding when each fires, and what it costs, is the foundation for writing memory-efficient Python for ML.

Historical Context

Reference counting was invented by George Collins at IBM in 1960. The idea is simple: each object maintains a count of how many references point to it. When the count reaches zero, the object is immediately freed. This was used in IBM's list processing languages and later adopted by many Lisp dialects.

The fatal flaw of pure reference counting - its inability to collect cycles - was identified by John McCarthy (inventor of Lisp) almost immediately. If object A holds a reference to object B and object B holds a reference to object A, their counts will never reach zero even if nothing else references either of them.

Mark-and-sweep GC was invented by John McCarthy in 1960 as well, in the same paper that introduced Lisp. The idea: start from all "roots" (global variables, stack variables), mark every reachable object, then sweep through all known objects and collect the unmarked ones. This finds cycles. Mark-and-sweep requires pausing the program while marking (hence "stop-the-world") which causes latency spikes.

CPython adopted reference counting from the beginning (Guido van Rossum, 1991). The cyclic collector was added in Python 2.0 (2000) by Neil Schemenauer, primarily because real-world Python programs accumulated cyclic garbage (especially objects with __del__ methods in cycles). The generational design (three generations) was added at the same time, inspired by the Ungar generational hypothesis from Smalltalk research (David Ungar, 1984): most objects die young, so collecting young objects frequently and old objects rarely is more efficient than treating all objects equally.

Core Concepts

Reference Counting: The Primary Mechanism

Every CPython object contains a reference count field (ob_refcnt). Every time a new reference to an object is created (assignment, function argument, container append), the count increments. Every time a reference is destroyed (variable goes out of scope, container element deleted, reassignment), the count decrements. When the count hits zero, the object is immediately freed.

import sys

# Observe reference counts directly
x = [1, 2, 3]
print(f"After x = [...]:        refcount = {sys.getrefcount(x)}")
# sys.getrefcount itself adds a temporary reference, so result is 2, not 1

y = x                          # second reference
print(f"After y = x:            refcount = {sys.getrefcount(x)}")  # 3

z = [x, x]                     # two more references from the list
print(f"After z = [x, x]:       refcount = {sys.getrefcount(x)}")  # 5

del z                          # removes 2 references
print(f"After del z:            refcount = {sys.getrefcount(x)}")  # 3

y = None                       # removes 1 reference
print(f"After y = None:         refcount = {sys.getrefcount(x)}")  # 2
# (still 2 because sys.getrefcount adds a temporary ref)

del x
# refcount drops to 0, object is immediately freed
# No waiting for GC - this happens right here, right now

# Measuring when objects are freed
class TrackedObject:
    def __init__(self, name):
        self.name = name
        print(f"  Allocated: {self.name}")

    def __del__(self):
        print(f"  Freed:     {self.name}")

print("\nReference counting in action:")
print("Creating a:")
a = TrackedObject("A")
print("Creating b = a:")
b = a
print("Deleting a:")
del a                          # refcount: 2 -> 1, NOT freed yet
print("Deleting b:")
del b                          # refcount: 1 -> 0, FREED immediately
print("Both deleted")

The beauty of reference counting is its immediacy: objects are freed the instant they become unreachable. There is no GC pause, no accumulation of dead objects. For simple acyclic object graphs (no circular references), this is essentially perfect garbage collection.

The cost is per-operation overhead: every assignment, every function call, every return statement must atomically increment or decrement a reference count. In a multithreaded context, these need to be atomic operations (CPython uses a GIL to avoid this, but the count manipulation still has non-zero cost).

The Cycle Problem

Reference counting fails for cycles: if object A holds a reference to object B and B holds a reference to A, both counts are at least 1 even if nothing else references them. They will never be freed by reference counting alone.

import gc

# Create a reference cycle
class Node:
    def __init__(self, name):
        self.name = name
        self.next = None

    def __del__(self):
        print(f"  Freed: {self.name}")

# Disable automatic GC to observe the cycle problem clearly
gc.disable()

print("Creating cycle: A -> B -> A")
a = Node("A")
b = Node("B")
a.next = b   # A holds reference to B
b.next = a   # B holds reference to A

# Now delete the external references
del a
del b
# Despite del, neither __del__ is called yet
# Because: A.refcount = 1 (from b.next=a), B.refcount = 1 (from a.next=b)
# They are unreachable but not freed

print("After del a, del b - still not freed (cycle!)")

# Manual collection finds and breaks the cycle
print("Calling gc.collect()...")
collected = gc.collect()
print(f"  gc.collect() freed {collected} objects")
# Now __del__ fires for both A and B

gc.enable()

Cycles arise more often than you might expect in Python:

# Common cycle patterns in Python ML code

# Pattern 1: Object with reference to its container
class Dataset:
    def __init__(self):
        self.items = []
        for i in range(1000):
            item = {'data': list(range(100)), 'dataset': self}  # cycle!
            self.items.append(item)

# Pattern 2: Parent-child with back-reference
class TreeNode:
    def __init__(self, value, parent=None):
        self.value = value
        self.parent = parent   # back-reference creates cycle if parent has child list
        self.children = []

    def add_child(self, value):
        child = TreeNode(value, parent=self)
        self.children.append(child)
        return child

# Pattern 3: Closure capturing a reference to an object that captures the closure
class Model:
    def __init__(self):
        self.callback = lambda: self   # closure references self, self holds closure

# Pattern 4: Exception with traceback (very common in ML error handling)
import sys

def function_with_big_tensors():
    data = list(range(1_000_000))   # large local variable
    raise RuntimeError("something failed")

try:
    function_with_big_tensors()
except RuntimeError:
    exc_type, exc_value, exc_tb = sys.exc_info()
    # exc_tb -> frame -> f_locals -> data: the list is still alive!
    # Fix: exc_tb = None when done with it

print("Cycle patterns demonstrated")

Mark-and-Sweep and Generational Collection

CPython's cyclic collector uses a simplified mark-and-sweep algorithm applied to a "generation" system. The algorithm works in three phases:

The three generations implement the generational hypothesis: most objects die young, so it is efficient to collect young objects frequently and old objects rarely.

Generation 0 (gen0): newly allocated objects. Collected when the number of allocations minus deallocations exceeds the threshold (default: 700).
Generation 1 (gen1): objects that survived one gen0 collection. Collected when gen0 has been collected more than 10 times.
Generation 2 (gen2): objects that survived a gen1 collection. Collected when gen1 has been collected more than 10 times.

The default thresholds are (700, 10, 10):

import gc

# Inspect GC configuration
print(f"GC enabled: {gc.isenabled()}")
print(f"Thresholds: {gc.get_threshold()}")   # (700, 10, 10)
print(f"Counts:     {gc.get_count()}")       # current allocation counts per generation

# Manually trigger collection
gc.collect(0)   # collect gen0 only
gc.collect(1)   # collect gen0 and gen1
gc.collect(2)   # full collection: all generations
gc.collect()    # same as gc.collect(2)

# Count of objects tracked by GC
tracked = gc.get_objects()
print(f"\nObjects tracked by GC: {len(tracked)}")
# Note: this includes all container objects (list, dict, set, instances)
# but NOT non-container types (int, float, str, bytes, None)

# Observe collection statistics
gc.set_debug(gc.DEBUG_STATS)
# Now any gc.collect() will print statistics
gc.collect()
gc.set_debug(0)   # turn off debug output

# Check if there are any uncollectable objects (rare, usually finalizer issues)
uncollectable = gc.garbage
print(f"\nUncollectable objects: {len(uncollectable)}")

GC Pauses and Latency

The cyclic GC is stop-the-world: while it runs, the Python thread cannot execute any user code. This is the source of the latency spikes described in the opening scenario.

How long does a GC pause take? It depends on the number of tracked objects. For a typical ML serving process with millions of Python objects, a full gen2 collection can take 5-50 ms. Gen0 collection (the most frequent) usually takes 0.1-2 ms.

import gc
import time
import statistics

def measure_gc_pause():
    """Measure GC pause time by forcing collection and timing it."""
    gc.disable()

    # Create a lot of short-lived objects to accumulate garbage
    garbage_cycles = []
    for i in range(10000):
        a = {'ref': None}
        b = {'ref': a}
        a['ref'] = b   # create a cycle
        garbage_cycles.append((a, b))

    # Drop references but don't trigger GC yet
    del garbage_cycles

    # Now measure how long collection takes
    pauses = []
    for _ in range(20):
        # Create more garbage
        for i in range(500):
            a = {'ref': None}
            b = {'ref': a}
            a['ref'] = b

        start = time.perf_counter()
        collected = gc.collect()
        elapsed = time.perf_counter() - start
        pauses.append(elapsed * 1000)

    gc.enable()

    print(f"GC pause statistics (ms):")
    print(f"  Min:    {min(pauses):.3f} ms")
    print(f"  Median: {statistics.median(pauses):.3f} ms")
    print(f"  Max:    {max(pauses):.3f} ms")
    print(f"  Mean:   {statistics.mean(pauses):.3f} ms")

measure_gc_pause()

Disabling GC for Performance-Critical Sections

The most impactful GC optimization for ML serving is to disable automatic collection during hot paths and collect during idle periods:

import gc
import time

# Benchmark: with vs without GC during allocation-heavy code
def create_objects(n):
    """Simulate a workload that creates many intermediate Python objects."""
    result = []
    for i in range(n):
        item = {
            'id': i,
            'features': [float(j) for j in range(20)],
            'meta': {'timestamp': time.time(), 'version': 1}
        }
        result.append(item)
    return len(result)

# With GC enabled (default)
gc.enable()
gc.collect()
start = time.perf_counter()
for _ in range(10):
    count = create_objects(5000)
with_gc = time.perf_counter() - start

# With GC disabled
gc.disable()
gc.collect()   # manual pre-clean
start = time.perf_counter()
for _ in range(10):
    count = create_objects(5000)
without_gc = time.perf_counter() - start
gc.enable()
gc.collect()   # clean up after

print(f"With GC:    {with_gc*1000:.1f} ms")
print(f"Without GC: {without_gc*1000:.1f} ms")
print(f"Speedup: {with_gc/without_gc:.2f}x")

# Production pattern: disable GC during request handling,
# collect between requests
class MLServer:
    def __init__(self, model):
        self.model = model
        self._request_count = 0

    def handle_request(self, request):
        gc.disable()
        try:
            result = self._process(request)
            self._request_count += 1
            return result
        finally:
            gc.enable()
            # Collect only every N requests to amortize pause cost
            if self._request_count % 100 == 0:
                gc.collect()

    def _process(self, request):
        # hot path: no GC interruptions
        return {"prediction": 0.5}

Tuning GC Thresholds

Rather than disabling GC entirely, you can tune thresholds to reduce collection frequency:

import gc
import time

# Default thresholds
default_thresholds = gc.get_threshold()
print(f"Default thresholds: {default_thresholds}")   # (700, 10, 10)

# For ML serving with large models and few cycles:
# Increase gen0 threshold to reduce collection frequency
# Trade-off: more memory consumption between collections, fewer pauses
gc.set_threshold(10000, 10, 10)   # collect gen0 only after 10K allocations
print(f"New thresholds:     {gc.get_threshold()}")

# For training loops that create many temporary tensors:
# Even higher threshold, or just disable and collect manually at epoch boundaries
gc.set_threshold(50000, 10, 10)

# Benchmark: threshold impact on throughput
def run_workload():
    results = []
    for i in range(1000):
        a = {'data': list(range(10)), 'meta': {'i': i}}
        b = {'ref': a, 'extra': [1, 2, 3]}
        a['parent'] = b   # cycle
        results.append(id(a))
    return results

gc.set_threshold(700, 10, 10)   # default
gc.collect()
start = time.perf_counter()
for _ in range(50):
    run_workload()
t_default = time.perf_counter() - start

gc.set_threshold(50000, 10, 10)   # high threshold
gc.collect()
start = time.perf_counter()
for _ in range(50):
    run_workload()
t_high = time.perf_counter() - start

# Restore default
gc.set_threshold(*default_thresholds)
gc.collect()

print(f"Default threshold: {t_default*1000:.1f} ms")
print(f"High threshold:    {t_high*1000:.1f} ms")
print(f"Speedup: {t_default/t_high:.2f}x")

Weak References

A weak reference is a reference that does not increment the reference count. It "weakly" holds an object - the object can still be garbage collected, and the weak reference becomes None when it is.

Weak references solve the problem of caches that should not prevent objects from being collected:

import weakref
import gc

class ExpensiveModel:
    def __init__(self, name):
        self.name = name
        self.weights = list(range(100000))   # simulate large model weights
        print(f"  Loaded: {self.name}")

    def predict(self, x):
        return sum(self.weights[:10])

    def __del__(self):
        print(f"  Unloaded: {self.name}")

# Pattern 1: Weak reference to avoid preventing GC
model = ExpensiveModel("bert-base")

# Strong reference: model stays alive as long as ref exists
strong_ref = model

# Weak reference: does not prevent GC
weak_ref = weakref.ref(model)

print(f"Object alive via weak_ref: {weak_ref() is not None}")
print(f"Can still use it: {weak_ref().predict(1)}")

del model
del strong_ref
# Now no strong references remain - model can be collected

import gc
gc.collect()

print(f"Object alive after del: {weak_ref() is not None}")
# weak_ref() returns None once the object is collected

# Pattern 2: WeakValueDictionary for model caching
# Cache that does not prevent GC of unused models
model_cache = weakref.WeakValueDictionary()

def load_model(name):
    if name in model_cache:
        print(f"  Cache hit: {name}")
        return model_cache[name]
    print(f"  Cache miss, loading: {name}")
    model = ExpensiveModel(name)
    model_cache[name] = model
    return model

print("\nModel cache demo:")
m1 = load_model("gpt2")
m2 = load_model("gpt2")   # cache hit
print(f"m1 is m2: {m1 is m2}")   # True: same object

del m1
del m2
gc.collect()   # model is now garbage collected

m3 = load_model("gpt2")   # cache miss: model was collected
del m3

# Pattern 3: weakref.proxy for transparent weak reference
model = ExpensiveModel("resnet50")
proxy = weakref.proxy(model)
print(f"\nProxy prediction: {proxy.predict(1)}")   # works like normal reference
del model
gc.collect()
# proxy.predict(1) would now raise ReferenceError

The del Finalizer Problem

Objects with __del__ methods interact badly with the cyclic GC. If two objects in a cycle both have __del__, the GC cannot determine the safe order to run them, so they are moved to gc.garbage (a global list) instead of being freed. This is a memory leak:

import gc

gc.set_debug(gc.DEBUG_UNCOLLECTABLE)

# Objects with __del__ in cycles are uncollectable in Python 3.3 and earlier
# Python 3.4+ improved this, but finalizers still interact with GC in subtle ways

class A:
    def __init__(self):
        self.b = None

    def __del__(self):
        pass   # finalizer - even a no-op causes issues in cycles

class B:
    def __init__(self):
        self.a = None

    def __del__(self):
        pass

a = A()
b = B()
a.b = b   # cycle
b.a = a

del a
del b
gc.collect()

print(f"gc.garbage contents: {len(gc.garbage)} objects")
# Python 3.4+ handles this better but still not perfectly

# Clean up
gc.garbage.clear()
gc.set_debug(0)

# The fix: use __del__ only when absolutely necessary
# Better pattern: context manager or explicit cleanup method
class ResourceManager:
    def __init__(self, resource_name):
        self.resource_name = resource_name
        self._closed = False

    def close(self):
        if not self._closed:
            print(f"Releasing: {self.resource_name}")
            self._closed = True

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()
        return False

# Correct usage: explicit close via context manager
with ResourceManager("gpu_memory_pool") as rm:
    pass   # rm.close() is called automatically

danger

Never rely on __del__ for critical resource cleanup (file handles, GPU memory, database connections) in Python. The timing of __del__ calls is implementation-defined and reference-dependent. Objects in reference cycles with __del__ methods may never be collected at all. Always use context managers (with statements) for deterministic resource cleanup.

tracemalloc: Tracking Allocations

tracemalloc is Python's built-in allocation tracker. It records the call site of every allocation, enabling before/after comparisons:

import tracemalloc
import linecache
import os

def display_top(snapshot, key_type='lineno', limit=10):
    """Display the top memory consumers from a tracemalloc snapshot."""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)
    print(f"\nTop {limit} memory consumers:")
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print(f"  #{index}: {filename}:{frame.lineno}: "
              f"{stat.size/1024:.1f} KB")
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print(f"         {line}")

# Start tracking with 10-frame tracebacks
tracemalloc.start(10)
snapshot1 = tracemalloc.take_snapshot()

# Simulate a data loading operation with a memory leak
leaky_cache = []   # module-level list that grows

def load_batch(batch_id):
    batch_data = {
        'ids': list(range(1024)),
        'embeddings': [[float(j) for j in range(768)] for _ in range(16)],
        'batch_id': batch_id,
    }
    leaky_cache.append(batch_data)   # leak: never cleared
    return batch_data

for i in range(20):
    load_batch(i)

snapshot2 = tracemalloc.take_snapshot()

# Compare snapshots
stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Memory increase by location:")
for stat in stats[:5]:
    print(f"  {stat}")

total_increase = sum(s.size_diff for s in stats if s.size_diff > 0)
print(f"\nTotal increase: {total_increase/1024:.1f} KB")

# Clean snapshot
display_top(snapshot2)

# Clean up
leaky_cache.clear()
tracemalloc.stop()

objgraph: Visualizing Reference Cycles

objgraph is a third-party library that makes reference cycles visible:

# Install: pip install objgraph
import objgraph
import gc

# Show the most common object types in memory
print("Most common object types:")
objgraph.show_most_common_types(limit=15)

# Find objects that increased since last call
objgraph.get_leaking_objects()

# Create some cycles for demonstration
class Container:
    pass

cycles = []
for i in range(100):
    a = Container()
    b = Container()
    a.ref = b
    b.ref = a   # cycle
    cycles.append(a)

# Show growth in Container objects
print("\nGrowth in Container type:")
objgraph.show_growth(shortnames=True)

# Show what holds references to a specific object
obj = cycles[0]
print(f"\nBackrefs to cycles[0]:")
objgraph.show_backrefs(
    obj,
    max_depth=3,
    too_many=10,
    # filename="backrefs.png"  # uncomment to generate a graph image
)

# Find all reference cycles
print("\nAll reference cycles in gc.garbage:")
gc.set_debug(gc.DEBUG_SAVEALL)
del cycles
gc.collect()
print(f"Collected cyclic garbage: {len(gc.garbage)} objects")

# Find which types are most common in garbage
from collections import Counter
garbage_types = Counter(type(obj).__name__ for obj in gc.garbage)
print("Types in garbage:")
for type_name, count in garbage_types.most_common(5):
    print(f"  {type_name}: {count}")

gc.garbage.clear()
gc.set_debug(0)

Java GC Algorithms for Comparison

Understanding Java's GC landscape is valuable because ML serving infrastructure often includes JVM components, and the design trade-offs are instructive:

The lessons from JVM GC tuning that apply directly to Python ML serving:

Pause budget over throughput - for serving, a 50 ms pause hitting all concurrent requests is worse than 5% throughput reduction. Configure for predictable latency, not maximum throughput.
Young generation tuning - allocating more memory to the young generation (equivalent to raising Python's gen0 threshold) means fewer collections of long-lived objects, reducing major pause frequency.
Explicit collection at idle - GC in idle periods (between batches, during model reload) means pauses do not hit user requests. The equivalent in Python is calling gc.collect() in a background thread during low-traffic windows.
Reduce allocation rate - the best GC optimization is to allocate less. In Python, this means using NumPy arrays instead of Python lists for numerical data, reusing buffers instead of allocating per-request, and pre-allocating output buffers.

Production Engineering Notes

GC Strategy for ML Serving

The strategy depends on the serving pattern:

import gc
import time
import threading
from contextlib import contextmanager

@contextmanager
def gc_paused():
    """Context manager that disables GC for the duration of a block."""
    was_enabled = gc.isenabled()
    gc.disable()
    try:
        yield
    finally:
        if was_enabled:
            gc.enable()

class OptimizedMLServer:
    """
    ML server with GC tuning for low-latency serving.

    Strategy:
    1. Disable GC during request processing
    2. Collect in background thread during idle periods
    3. Force full collection after model reloads
    4. Use weakrefs for model cache to allow GC when needed
    """
    def __init__(self):
        # Tune thresholds: higher gen0 means less frequent auto-collection
        gc.set_threshold(10000, 10, 10)
        self._request_count = 0
        self._last_gc_time = time.time()
        self._gc_interval = 30.0   # seconds between background collections

        # Start background GC thread
        self._gc_thread = threading.Thread(
            target=self._background_gc,
            daemon=True
        )
        self._gc_thread.start()

    def handle_request(self, request):
        """Process request with GC disabled to avoid pauses."""
        with gc_paused():
            return self._process(request)

    def _process(self, request):
        # Actual model inference happens here
        time.sleep(0.001)   # simulate 1ms inference
        return {"result": "ok"}

    def _background_gc(self):
        """Run GC during idle periods in a background thread."""
        while True:
            time.sleep(5)   # check every 5 seconds
            now = time.time()
            if now - self._last_gc_time > self._gc_interval:
                collected = gc.collect()
                self._last_gc_time = now
                if collected > 0:
                    print(f"Background GC: collected {collected} objects")

# Usage
server = OptimizedMLServer()

# Simulate request handling
import random
latencies = []
for i in range(1000):
    start = time.perf_counter()
    server.handle_request({"input": [float(j) for j in range(512)]})
    latency = (time.perf_counter() - start) * 1000
    latencies.append(latency)

import statistics
print(f"\nLatency statistics (ms):")
print(f"  Median: {statistics.median(latencies):.2f}")
print(f"  P95:    {sorted(latencies)[950]:.2f}")
print(f"  P99:    {sorted(latencies)[990]:.2f}")
print(f"  Max:    {max(latencies):.2f}")

GC for Training Loops

Training loops have different requirements: throughput matters more than latency, but GC pauses during a training step add up across millions of steps:

import gc
import time

def train_epoch_with_gc(model_fn, data, epochs=5):
    """Training with explicit GC management."""
    for epoch in range(epochs):
        # Disable GC during the epoch for consistent step times
        gc.disable()

        step_times = []
        for step, batch in enumerate(data):
            step_start = time.perf_counter()
            loss = model_fn(batch)
            step_times.append(time.perf_counter() - step_start)

        # Collect at epoch boundary: no user is waiting
        gc.enable()
        collected = gc.collect()

        avg_step = sum(step_times) / len(step_times) * 1000
        print(f"Epoch {epoch+1}: avg step={avg_step:.2f}ms, "
              f"gc collected={collected}")

# Simulated training
def fake_model(batch):
    # Creates intermediate Python objects
    activations = [[float(j) for j in range(100)] for _ in range(32)]
    loss = sum(sum(row) for row in activations)
    return loss

fake_data = [{"x": list(range(100))} for _ in range(50)]
train_epoch_with_gc(fake_model, fake_data)

warning

Do not leave GC disabled indefinitely in a long-running training loop. Cyclic garbage will accumulate without bound until gc.collect() is called. If your training loop creates reference cycles (common with model callbacks, custom schedulers, and logging hooks), you need to call gc.collect() at regular intervals - at epoch boundaries or every N steps.

Monitoring Memory Growth with tracemalloc

For diagnosing long-running memory growth, periodic snapshot comparison is the most effective tool:

import tracemalloc
import gc
import time

class MemoryGrowthMonitor:
    """Monitors memory growth by comparing tracemalloc snapshots."""

    def __init__(self, top_n=10):
        self.top_n = top_n
        self._baseline = None
        tracemalloc.start(25)   # 25-frame tracebacks

    def set_baseline(self):
        """Call after system warm-up, before the workload you want to monitor."""
        gc.collect()
        self._baseline = tracemalloc.take_snapshot()
        print("Baseline snapshot taken")

    def report_growth(self, label="current"):
        """Compare current state against baseline."""
        gc.collect()
        current = tracemalloc.take_snapshot()

        if self._baseline is None:
            print("No baseline set. Call set_baseline() first.")
            return

        stats = current.compare_to(self._baseline, 'traceback')
        print(f"\nMemory growth since baseline ({label}):")
        for stat in stats[:self.top_n]:
            if stat.size_diff > 1024:   # only show >1 KB growth
                print(f"  +{stat.size_diff/1024:.1f} KB: {stat.traceback.format()[0]}")

        total = sum(s.size_diff for s in stats)
        print(f"  Total: {total/1024/1024:.2f} MB")

    def stop(self):
        tracemalloc.stop()

# Usage in a serving context
monitor = MemoryGrowthMonitor()

# Warm up
for i in range(10):
    _ = [{"x": list(range(100))} for _ in range(1000)]
gc.collect()

monitor.set_baseline()

# Simulate 1 hour of request handling (compressed)
for minute in range(60):
    for _ in range(100):
        _ = [{"x": list(range(100))} for _ in range(100)]

    if minute % 10 == 0:
        monitor.report_growth(f"minute {minute}")

monitor.stop()

Interview Q&A

Q1: How does CPython's reference counting work, and what is its main limitation?

Every CPython object has an ob_refcnt field (a C Py_ssize_t). Every time a new reference to the object is created (assignment, function argument, list append, etc.), the count is incremented. Every time a reference is destroyed (variable goes out of scope, del, reassignment, list element deleted), the count is decremented. When the count reaches zero, the object's memory is freed immediately - no waiting for a GC cycle.

The main limitation is that reference counting cannot collect reference cycles. If object A holds a reference to B and B holds a reference to A, both counts are at least 1 even when no other part of the program can reach either object. CPython has a supplementary cyclic garbage collector (the gc module) to handle this case.

Q2: What is the generational hypothesis, and why does it improve GC performance?

The generational hypothesis observes that in real programs, most objects die very young - they are created during a function call and freed when the function returns. Only a small fraction of objects live long enough to persist across many GC cycles. This hypothesis, confirmed by empirical measurement of many Smalltalk and Lisp programs in the 1980s, motivates dividing objects into generations. New objects go into generation 0. Objects that survive a gen0 collection move to generation 1. Generation 0 is collected frequently (every 700 allocations in CPython's default config), generation 1 less frequently, and generation 2 (the oldest) rarely. This reduces GC cost because: (1) most garbage is found in gen0 collections which are cheap (small working set), and (2) long-lived objects that are unlikely to be garbage are examined less often.

Q3: Why do GC pauses affect all concurrent requests, not just the one that triggered GC?

CPython uses a Global Interpreter Lock (GIL) - only one thread executes Python bytecode at a time. The GC runs with the GIL held (it must, to safely traverse object graphs). When the GC runs, all Python threads are effectively paused because they are waiting for the GIL. In an async serving framework (like uvicorn/asyncio), the event loop runs in a single thread, so a GC pause stops all in-flight request handlers simultaneously. A 5 ms GC pause adds 5 ms to the latency of every request that was being processed in that moment - which in a high-concurrency system could be hundreds of requests.

Q4: When should you use weakref.ref vs a strong reference, and what happens when the referent is collected?

Use a weak reference when you want to hold a pointer to an object without preventing that object from being garbage collected. The canonical use case is a cache: you want to reuse an expensive object if it is still alive, but you do not want the cache itself to be the reason the object stays alive. When the last strong reference to an object is dropped and the reference count hits zero, the object is freed and all weakref.ref objects pointing to it are set to return None on the next call. Code using a weak reference must always check if weak_ref() returns None before using it. weakref.WeakValueDictionary automates this - entries are automatically removed when their values are collected, preventing the dict from growing indefinitely as a memory leak.

Q5: What makes del dangerous in Python, and what is the correct alternative?

__del__ is dangerous for three reasons. First, the timing of __del__ calls is not guaranteed - in CPython it fires immediately when the reference count hits zero, but in the presence of cycles it may fire during GC, after an arbitrary delay, or (in Python 3.3 and earlier) never if the object is in a cycle with other __del__ objects. Second, the state of the interpreter during __del__ is undefined - by the time module-level __del__ finalizers run during interpreter shutdown, global variables may already be None. Third, exceptions raised in __del__ are silently ignored (they print to stderr but do not propagate). The correct alternative is to use context managers (with statements and __enter__/__exit__) for deterministic resource cleanup. For objects that genuinely need cleanup but may be used without a with statement, add an explicit close() method and document that it must be called.

Q6: How would you diagnose a gradual memory leak in a Python ML serving process?

Four-step procedure. Step 1: establish a baseline. Use tracemalloc.start() and take a snapshot after the service has warmed up (handled a few hundred requests) so initial imports and JIT-warmed paths are stable. Step 2: run the service under realistic load for 10-30 minutes. Step 3: take a second snapshot and compare with snapshot2.compare_to(snapshot1, 'lineno'). Look for allocations that grew monotonically, especially those pointing to module-level data structures, cache objects, or callback registrations. Step 4: use objgraph.show_most_common_types() and objgraph.show_growth() to identify which types are accumulating, then use objgraph.show_backrefs() on sample objects to trace what is holding references to them. Common culprits: module-level lists used as logs, callback registrations that accumulate without cleanup, unclosed file handles in DataLoader workers, and exception tracebacks stored in error reporting structures.

Q7: How does gc.disable() interact with reference counting?

gc.disable() only disables the cyclic garbage collector - the supplementary collector that handles reference cycles. It does not affect reference counting at all. Reference counting continues to operate normally: objects with zero reference counts are still freed immediately. What accumulates when GC is disabled is cyclic garbage - objects in reference cycles that reference counting cannot collect. If your code does not create reference cycles (no circular object references), disabling GC has no effect on memory - reference counting handles everything. In practice, Python programs do create some cycles (especially with closures, class instances with callbacks, and exception tracebacks), so disabling GC should be paired with periodic manual gc.collect() calls to prevent cyclic garbage from accumulating indefinitely.

The Production Scenario​

Why This Exists​

Historical Context​

Core Concepts​

Reference Counting: The Primary Mechanism​

The Cycle Problem​

Mark-and-Sweep and Generational Collection​

GC Pauses and Latency​

Disabling GC for Performance-Critical Sections​

Tuning GC Thresholds​

Weak References​

The del Finalizer Problem​

tracemalloc: Tracking Allocations​

objgraph: Visualizing Reference Cycles​

Java GC Algorithms for Comparison​

Production Engineering Notes​

GC Strategy for ML Serving​

GC for Training Loops​

Monitoring Memory Growth with tracemalloc​

Interview Q&A​