What is python garbage collection?

Master CPython's cyclic garbage collector at engineering depth - generational collection, three generations, cycle detection algorithm, gc module API, __del__ and PEP 442, gc.freeze() for fork, gc.get_referrers() for leak diagnosis, and common memory leak patterns.

How does python gc module work in practice?

Garbage Collection - Generational GC, Cycle Detection, and Memory Leak Diagnosis covers python garbage collection, python gc module, python cyclic garbage collector from first principles with code examples. Free lesson at https://engineersofai.com/docs/python/python-intermediate/python-internals/garbage-collection

What is the difference between python garbage collection and python cyclic garbage collector?

See the full breakdown at https://engineersofai.com/docs/python/python-intermediate/python-internals/garbage-collection

Garbage Collection - Generational GC, Cycle Detection, and Memory Leak Diagnosis

Reading time: ~30 minutes | Level: Intermediate → Engineering

Before reading further, predict the output:

import gc

class Node:
    def __init__(self, val):
        self.val = val
        self.next = None

a = Node(1)
b = Node(2)
a.next = b
b.next = a   # cycle

del a, b
print(gc.collect())  # ?

Show Answer

Output: 4

Most engineers expect 2 - there are two Node objects in the cycle. But gc.collect() returns the count of all unreachable objects that were collected, including:

The Node(1) instance
The Node(2) instance
The dict of Node(1).__dict__ (instance attribute dictionary)
The dict of Node(2).__dict__ (instance attribute dictionary)

Each Python class instance stores its attributes in a __dict__. Those dicts are themselves container objects tracked by the GC. Both dicts are part of the cycle:

Node(1) → __dict__ → {"val": 1, "next": Node(2)}
Node(2) → __dict__ → {"val": 2, "next": Node(1)}

The cycle involves 4 container objects, so gc.collect() returns 4. Understanding this requires knowing which objects the GC tracks and how cycle detection counts its results.

Reference counting is CPython's primary memory management mechanism (Lesson 05), but it has a fundamental blind spot: reference cycles. Two objects that reference each other will never have their ob_refcnt reach zero, even when the rest of the program has forgotten about them entirely. Python's cyclic garbage collector exists to handle exactly this case. Understanding it - its generations, its thresholds, its collection algorithm, and its production knobs - is essential for writing leak-free long-running Python services.

What You Will Learn

Why reference counting is insufficient: the reference cycle problem
CPython's cyclic GC: generational, three generations (gen 0, 1, 2)
How cycle detection works: the mark-and-sweep algorithm for container objects
The gc module: collect(), get_count(), get_threshold(), set_threshold()
gc.disable() and gc.enable(): when and how to disable the GC safely
__del__ and the finalizer resurrection problem - and how PEP 442 solved it
gc.get_referrers() and gc.get_referents() for debugging memory leaks
gc.freeze() (Python 3.7+): why Instagram and Gunicorn use it before forking
Memory leak patterns: event listeners, class attributes, QuerySet caching
tracemalloc: preview of the full memory profiling lesson

Prerequisites

Lesson 05: Reference Counting - you need to understand ob_refcnt and why cycles defeat it
Lesson 01: CPython Architecture - Python objects as C structs
Familiarity with gc module basics

Part 1 - Why Reference Counting Is Not Enough

The Cycle Problem Revisited

As established in Lesson 05: when two objects reference each other, their ob_refcnt fields never reach zero, even after all external references are deleted:

import sys, gc

gc.disable()   # prevent GC from interfering with this demonstration

class Node:
    def __init__(self, val):
        self.val = val
        self.next = None

a = Node(1)
b = Node(2)
a.next = b
b.next = a

# External references: 'a' and 'b' names
print(sys.getrefcount(a) - 1)   # 2: name 'a' + b.next
print(sys.getrefcount(b) - 1)   # 2: name 'b' + a.next

del a, b
# External references removed - but cycles remain:
# Node(1).__dict__["next"] → Node(2) - refcount still 1
# Node(2).__dict__["next"] → Node(1) - refcount still 1
# Both nodes are UNREACHABLE but UNFREE-able by reference counting

# Only the GC can detect and collect these
print(gc.collect())   # 4 - collected Node(1), Node(2), and their __dicts__

gc.enable()

What Objects the GC Tracks

The cyclic GC only tracks container objects - objects that can hold references to other objects:

list, tuple, dict, set, frozenset
Class instances (anything with __dict__)
Custom classes that implement tp_traverse in C
function, code, frame objects

The GC does not track:

int, float, bool, complex - scalar values cannot form cycles
str, bytes, bytearray - immutable, cannot contain references
None, True, False - singletons

This distinction is critical: immutable scalars are always collected immediately by reference counting; the GC only needs to handle containers.

Part 2 - Generational Collection

The Generational Hypothesis

The cyclic GC uses a generational strategy based on the empirical observation that:

Most objects die young. Objects that survive multiple GC cycles are likely to survive many more.

This is the generational hypothesis, first described for Lisp and Smalltalk in the 1980s and now used by virtually every production garbage collector.

CPython implements three generations:

Generation	Label	Objects	Collected when
0	Young	Newly allocated containers	Most frequently
1	Middle	Survived one gen 0 collection	Less frequently
2	Old	Long-lived objects	Least frequently

Generation Thresholds

import gc

# Default thresholds: (700, 10, 10)
print(gc.get_threshold())   # (700, 10, 10)

# Interpretation:
# - Collect gen 0 when (allocations - deallocations) > 700
# - Collect gen 1 after gen 0 has been collected 10 times
# - Collect gen 2 after gen 1 has been collected 10 times
# Default: gen 2 collected every 700 * 10 * 10 = 70,000 net allocations

The threshold (700, 10, 10) means:

Gen 0: collected when the number of tracked objects allocated since the last collection exceeds 700
Gen 1: collected every 10th time gen 0 is collected
Gen 2: collected every 10th time gen 1 is collected

Observing Generation Counts

import gc

# gc.get_count() returns (gen0_count, gen1_count, gen2_count)
# These are the number of tracked objects in each generation
print(gc.get_count())   # e.g., (42, 3, 1)

# Allocate a bunch of objects and watch gen 0 grow
for _ in range(100):
    x = [i for i in range(10)]   # allocates list + inner list
print(gc.get_count())   # gen 0 count grew

# Force a collection
collected = gc.collect(0)   # collect only gen 0
print(f"Gen 0 collected: {collected} objects")
print(gc.get_count())

collected = gc.collect(1)   # collect gen 0 + gen 1
print(f"Gen 1 collected: {collected} objects")

collected = gc.collect(2)   # full collection (all generations)
print(f"Full collection: {collected} objects")

Part 3 - How Cycle Detection Works

The Mark-and-Sweep Algorithm

CPython's cycle detector uses a modified mark-and-sweep algorithm. Here is how it works conceptually:

Step 1: Copy refcounts For every tracked object in the generation being collected, copy ob_refcnt into a shadow field (gc_refs). This working copy is modified during detection without affecting the actual object graph.

Step 2: Subtract internal references For every object, traverse all references it holds (via tp_traverse) and decrement gc_refs of each referenced object. After this step, gc_refs counts only external references - references from objects outside the current generation.

Step 3: Identify unreachable objects Objects with gc_refs == 0 after step 2 have no external references. They can only be referenced from within the generation itself - they are part of unreachable cycles.

Step 4: Expand the unreachable set Objects that are referenced by unreachable objects are also unreachable (if they have no other external references). Mark them unreachable too.

Step 5: Deallocate Call tp_dealloc on all unreachable objects.

Why the Count Is 4, Not 2

This explains the opening puzzle: gc.collect() returned 4 because the cycle involves four container objects - the two Node instances and their two __dict__ instances. Each class instance's attributes are stored in a dict, which is itself a tracked container object participating in the cycle.

Part 4 - The gc Module API

Core Functions

import gc

# --- Collection ---

# Collect all generations (equivalent to gc.collect(2))
n = gc.collect()
print(f"Collected {n} objects")

# Collect specific generation (0, 1, or 2)
gc.collect(0)   # gen 0 only (fastest)
gc.collect(1)   # gen 0 + gen 1
gc.collect(2)   # all generations (slowest, most thorough)

# --- Inspection ---

# Count of tracked objects in each generation
print(gc.get_count())           # (gen0_objects, gen1_objects, gen2_objects)

# Current collection thresholds
print(gc.get_threshold())       # (700, 10, 10) by default

# Set thresholds
gc.set_threshold(1000, 15, 15)  # less frequent collection

# --- Control ---

gc.disable()   # stop automatic collection
gc.enable()    # resume automatic collection
print(gc.isenabled())   # True/False

# --- Statistics (Python 3.3+) ---
stats = gc.get_stats()
# Returns list of 3 dicts, one per generation:
# [{'collections': N, 'collected': N, 'uncollectable': N}, ...]
for gen, stat in enumerate(stats):
    print(f"Gen {gen}: {stat}")

gc.get_referrers and gc.get_referents

These are the most powerful tools for debugging memory leaks:

import gc

class LeakyService:
    instances = []   # class-level list - dangerous!

    def __init__(self, name):
        self.name = name
        LeakyService.instances.append(self)   # keeps all instances alive forever

svc1 = LeakyService("auth")
svc2 = LeakyService("db")

# Who is referencing svc1?
referrers = gc.get_referrers(svc1)
for r in referrers:
    print(f"Referrer type: {type(r).__name__}")
    if isinstance(r, list):
        print(f"  List id={id(r)}, len={len(r)}")
    elif isinstance(r, dict):
        keys = [k for k, v in r.items() if v is svc1]
        print(f"  Dict keys pointing to svc1: {keys}")
# Will show: LeakyService.instances list is holding svc1

# What does svc1 refer to?
referents = gc.get_referents(svc1)
for ref in referents:
    print(f"Referent: {type(ref).__name__}: {ref!r}")
# Will show: svc1's __dict__, svc1's __class__, etc.

:::tip Use gc.get_threshold() to Understand When Collections Trigger In batch workloads (data pipelines, one-shot scripts), you may want to tune GC thresholds. Increase thresholds to collect less frequently (lower CPU overhead, higher peak memory). Decrease to collect more aggressively (lower peak memory, higher CPU overhead). Measure both before changing defaults.

import gc

# For a batch job that allocates many short-lived objects:
# Raise threshold to reduce GC overhead during the hot path
gc.set_threshold(10000, 20, 20)
run_batch_job()
gc.collect()   # force full collection at end

:::

Part 5 - When to Disable the GC

gc.disable() in Production

Disabling the GC is legitimate in specific scenarios:

import gc

# Pattern 1: Disable during performance-critical initialization
gc.disable()
try:
    # Build large data structures - no GC interruptions
    data = {i: [j for j in range(100)] for i in range(10000)}
    index = build_search_index(data)
finally:
    gc.enable()
    gc.collect()   # clean up any cycles created during initialization

# Pattern 2: Long-running batch job with no cycles (all objects are short-lived)
# Reference counting handles everything - GC overhead adds up over millions of allocations
gc.disable()
for record in massive_dataset:
    result = process(record)   # record and result have no cycles
    write(result)
gc.enable()
gc.collect()

Instagram's Approach: gc.disable() at Startup

Instagram's engineering team (Pythonista Brett Slatkin and others) documented disabling the GC in their Django/uWSGI stack:

Before forking workers, they call gc.disable() and gc.collect() to ensure a clean heap
Worker processes never run GC automatically - they rely on reference counting only
This reduced CPU overhead significantly for their workload (mostly short-lived request objects with no cycles)
Periodically they restart workers, which naturally clears any leaked memory

This is only safe if your workload genuinely produces no reference cycles (or you accept that cyclic garbage accumulates until process restart).

:::danger gc.disable() in Production Without Understanding Your Memory Patterns Causes Unbounded Growth Disabling the GC means cycles accumulate forever. In a web server handling requests that create Django ORM objects, each request may create many cycles (ORM instances that reference their model class, which references the module, creating large retained graphs). With GC disabled, these never get collected. Memory grows without bound until the process is killed.

Before disabling the GC, verify with tracemalloc (Lesson 07) that your workload does not create significant cyclic garbage. :::

:::note Immutable Objects Cannot Form Cycles - The GC Ignores Them int, float, str, bytes, bool, None, tuple containing only immutable objects - none of these can form reference cycles. CPython does not register them with the cyclic GC. They are always collected immediately when their ob_refcnt reaches zero, with no GC overhead. If your workload is dominated by these types, the cyclic GC's overhead is minimal regardless of threshold settings. :::

Part 6 - del and the Finalizer Problem

The Resurrection Problem (Python < 3.4)

Before Python 3.4, objects with __del__ methods that were part of reference cycles were problematic:

# The problem in Python < 3.4 (historical, for understanding)
class Problematic:
    def __del__(self):
        # What if __del__ stores 'self' somewhere?
        # The GC would have to collect it, but what if __del__ revives it?
        global resurrected
        resurrected = self   # "resurrection" - object escapes deletion!

a = Problematic()
b = Problematic()
a.other = b
b.other = a   # cycle

del a, b
# Pre-3.4: CPython adds these to gc.garbage - uncollectable!
# gc.collect() returns 0 - the objects are NOT freed
# gc.garbage is a list of uncollectable objects
print(gc.garbage)   # [<Problematic object>, <Problematic object>]

PEP 442: Safe Object Finalization (Python 3.4+)

PEP 442 (implemented in Python 3.4) resolved the finalizer problem:

The GC identifies the unreachable set (objects only reachable from each other)
For each unreachable object with __del__, the GC calls __del__ before freeing memory
After all finalizers run, the GC checks if any objects were "resurrected" (new strong references created)
Objects that were not resurrected are freed
Objects that were resurrected survive (their refcount is now > 0 again)

import gc

class SafeFinalizer:
    def __init__(self, name):
        self.name = name

    def __del__(self):
        print(f"__del__ called on {self.name}")

# Modern Python (3.4+): __del__ is called safely even in cycles
a = SafeFinalizer("A")
b = SafeFinalizer("B")
a.other = b
b.other = a   # cycle

del a, b
gc.collect()
# Output:
# __del__ called on B
# __del__ called on A
# Both freed correctly - no gc.garbage accumulation

Checking gc.garbage

gc.garbage is a list where CPython puts uncollectable objects (primarily objects with __del__ in cycles that resurrect themselves):

import gc

print(gc.garbage)   # should be [] in well-written code

# If gc.garbage is non-empty, you have uncollectable objects
if gc.garbage:
    print(f"WARNING: {len(gc.garbage)} uncollectable objects")
    for obj in gc.garbage:
        print(f"  {type(obj).__name__}: {repr(obj)[:100]}")
    # Decide: clear it (losing the objects) or investigate
    gc.garbage.clear()

:::warning Objects with del in Python < 3.4 Are Uncollectable If in a Cycle If you must support Python < 3.4, or work with C extensions using the old finalizer protocol, avoid __del__ on objects that participate in cycles. Use weakref.finalize instead (available since Python 3.4) - it does not block cycle collection.

import weakref

class LegacySafeResource:
    def __init__(self, resource):
        self._resource = resource
        # weakref.finalize does NOT block GC of LegacySafeResource
        self._finalizer = weakref.finalize(self, cleanup, resource)

def cleanup(resource):
    resource.close()

:::

Part 7 - gc.freeze() for Fork-Based Servers

The Copy-on-Write Problem

Many Python web servers (uWSGI, Gunicorn) use os.fork() to create worker processes. Fork copies the parent's memory pages to the child, but modern OS kernels use copy-on-write (COW) - pages are only physically copied when written.

If the GC runs in a worker process and touches many objects (incrementing/decrementing shadow refcount fields during traversal), it triggers writes to pages that were previously shared with the parent. This causes COW pages to be copied, increasing memory usage per worker process.

# In a Gunicorn/uWSGI server:
# 1. Parent process loads Django: 100MB of module-level objects in gen 2
# 2. Fork 8 workers
# 3. If GC runs in workers (scanning gen 2), it writes to gen 2's gc_refs
# 4. COW: those pages are now private per-worker - 100MB × 8 workers = 800MB
# 5. Without GC touching gen 2: pages stay shared - 100MB + small per-worker delta

gc.freeze(): Protect Gen 2 from GC Traversal

gc.freeze() (Python 3.7+) moves all objects currently in all three generations into a permanent generation that is never traversed by the cyclic GC:

import gc

# Before forking: move all current objects to "frozen" generation
# These are module-level objects: Django models, URL patterns, middleware, etc.
gc.freeze()

# Now fork workers
# In each worker: the GC only tracks NEW objects (those created after fork)
# Frozen gen 2 objects are never traversed - COW pages stay shared

# After workers start:
# gc.get_freeze_count() shows how many objects are frozen
print(gc.get_freeze_count())  # e.g., 50000

# To unfreeze (rarely needed):
# gc.unfreeze()

This is used by Instagram, Yelp, and other high-traffic Python services running on Gunicorn. The memory savings are significant: on servers with 32 workers, gc.freeze() before fork can reduce memory usage by hundreds of megabytes.

# Typical Gunicorn pre-fork hook pattern:
# gunicorn_config.py

def pre_fork(server, worker):
    pass

def when_ready(server):
    """Called after Django is loaded, before forking workers."""
    import gc
    gc.freeze()
    server.log.info(f"GC frozen: {gc.get_freeze_count()} objects protected from COW")

Part 8 - Memory Leak Patterns

Pattern 1: Global Containers That Accumulate

# Classic leak: class-level container that grows unboundedly
class RequestHandler:
    _all_handlers = []   # class-level - holds ALL instances ever created

    def __init__(self, request_id):
        self.request_id = request_id
        RequestHandler._all_handlers.append(self)   # strong reference forever!

    def handle(self):
        return f"Handling {self.request_id}"

# In a web server: each request creates a RequestHandler
# They accumulate in _all_handlers and are NEVER freed
# Memory grows linearly with request count - server eventually OOMs

# Fix: use WeakValueDictionary or don't store in class-level container
import weakref

class RequestHandler:
    _all_handlers = weakref.WeakValueDictionary()

    def __init__(self, request_id):
        self.request_id = request_id
        RequestHandler._all_handlers[request_id] = self
        # freed when no other references hold the handler

Pattern 2: Event Listeners Not Removed

class EventBus:
    def __init__(self):
        self._listeners = {}   # event_type → list of listeners

    def on(self, event_type, listener):
        self._listeners.setdefault(event_type, []).append(listener)

    def emit(self, event_type, data):
        for listener in self._listeners.get(event_type, []):
            listener(data)


bus = EventBus()

class Widget:
    def __init__(self, name):
        self.name = name
        bus.on("click", self.handle_click)   # bus holds strong ref to self.handle_click

    def handle_click(self, data):
        print(f"{self.name} clicked: {data}")

# Widget instances can never be GC'd as long as bus exists
# Each Widget's bound method holds a reference to the Widget instance
# bus._listeners["click"] → [Widget.handle_click] → Widget

w1 = Widget("button1")
del w1   # does NOT free w1 - bus still holds a reference via bound method

# Fix: use weakref for listener storage
import weakref

class EventBus:
    def __init__(self):
        self._listeners: dict[str, list[weakref.ref]] = {}

    def on(self, event_type, listener):
        ref = weakref.ref(listener, lambda r: self._remove_dead(event_type, r))
        self._listeners.setdefault(event_type, []).append(ref)

    def _remove_dead(self, event_type, ref):
        if event_type in self._listeners:
            self._listeners[event_type] = [
                r for r in self._listeners[event_type] if r is not ref
            ]

    def emit(self, event_type, data):
        for ref in list(self._listeners.get(event_type, [])):
            listener = ref()
            if listener is not None:
                listener(data)

Pattern 3: Django QuerySet Caching

# Dangerous in class scope or module scope
class ReportGenerator:
    # This QuerySet is evaluated and CACHED at class definition time
    # All rows from the database are loaded into memory and held forever
    all_products = Product.objects.filter(active=True)   # NOT a leak in isolation

    def generate(self):
        # But if this is stored in a long-lived object, it holds all rows
        return [format_row(p) for p in self.all_products]

# In practice: Django QuerySets evaluate lazily, but once evaluated,
# the result cache stays on the QuerySet object.
# If the QuerySet is on a class attribute, the result cache never GCs.

# Fix: never store evaluated QuerySets in class attributes
class ReportGenerator:
    def generate(self):
        # Fresh QuerySet each time - evaluated, used, discarded
        products = Product.objects.filter(active=True)
        return [format_row(p) for p in products]   # products collected after function returns

Pattern 4: Large Objects in Closures

import sys

def make_processor(large_df):
    """large_df: a pandas DataFrame with millions of rows."""

    def process(key):
        return large_df.loc[key]   # closure captures the entire DataFrame

    return process   # large_df is held in process.__closure__

processor = make_processor(load_huge_dataframe())
# large_df cannot be GC'd - processor.__closure__[0].cell_contents IS the DataFrame

# Fix: extract only what you need before closing over it
def make_processor(large_df):
    lookup = large_df.set_index("key")["value"].to_dict()   # small dict
    large_df = None   # explicitly clear the reference before closure is returned

    def process(key):
        return lookup.get(key)   # captures small dict, not the DataFrame

    return process

Part 9 - tracemalloc Preview

The tracemalloc module (covered fully in Lesson 07) tracks memory allocations at the Python level, enabling precise leak diagnosis:

import tracemalloc
import gc

# Start tracing memory allocations
tracemalloc.start()

# Take a snapshot before the suspect operation
snapshot1 = tracemalloc.take_snapshot()

# Run the suspect code
for _ in range(1000):
    create_some_objects()   # might leak

gc.collect()   # force GC so only true leaks remain

# Take a snapshot after
snapshot2 = tracemalloc.take_snapshot()

# Compare - what grew?
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top memory growth:")
for stat in top_stats[:10]:
    print(f"  {stat}")

tracemalloc.stop()

The output shows exactly which line of code is responsible for the memory growth, making leak diagnosis tractable in production code.

Common Mistakes

Mistake 1 - Not Collecting Before Checking gc.garbage

# Wrong: gc.garbage may not be populated until you collect
import gc
print(gc.garbage)   # [] - but there may be cyclic garbage waiting

# Right: force collection first
gc.collect()
if gc.garbage:
    print(f"Uncollectable objects: {len(gc.garbage)}")

Mistake 2 - Disabling GC Without Understanding Object Graphs

# Wrong: disable GC without verifying no cycles exist
gc.disable()
run_web_server()   # Django ORM creates cycles - memory grows unboundedly

# Right: profile first, then decide
import tracemalloc
tracemalloc.start()
gc.disable()

run_for_100_requests()

gc.collect()
snap = tracemalloc.take_snapshot()
# If significant memory is collected by gc.collect(), GC is needed
tracemalloc.stop()
gc.enable()

Mistake 3 - Using gc.collect() as a Performance Fix

# Wrong: calling gc.collect() in a hot loop "just in case"
for record in dataset:
    process(record)
    gc.collect()   # 1000x slower - gc.collect() is expensive (scans all tracked objects)

# Right: let the GC work automatically; call gc.collect() only when necessary
for record in dataset:
    process(record)

gc.collect()   # once, at the end, if needed

Mistake 4 - Forgetting to Register tp_traverse in C Extensions

When writing a C extension that contains a container (a C struct that holds PyObject* pointers), failing to implement tp_traverse means the GC cannot see the objects it holds. They appear to have no references and are never reached by the cycle detector - memory leak:

// Wrong: no tp_traverse - GC can't see the contained objects
static PyTypeObject MyType = {
    .tp_name = "MyContainer",
    // tp_traverse = NULL  ← GC cannot traverse this type!
};

// Right: implement tp_traverse
static int my_traverse(MyObject *self, visitproc visit, void *arg) {
    Py_VISIT(self->contained_obj);   // tell GC about every held reference
    return 0;
}
static PyTypeObject MyType = {
    .tp_traverse = (traverseproc)my_traverse,
    .tp_flags = Py_TPFLAGS_HAVE_GC,   // must also set this flag
};

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: What does this print?

import gc

print(gc.get_threshold())

Show Answer

Output: (700, 10, 10)

These are CPython's default collection thresholds. Gen 0 is collected when the count of tracked objects (allocations minus deallocations since last collection) exceeds 700. Gen 1 is collected every 10th gen 0 collection. Gen 2 every 10th gen 1 collection.

Question 2: What does this print?

import gc

class WithDel:
    def __del__(self):
        print("deleted")

a = WithDel()
b = WithDel()
a.ref = b
b.ref = a

del a, b

print("before collect")
gc.collect()
print("after collect")

Show Answer

Output (Python 3.4+):

before collect
deleted
deleted
after collect

In Python 3.4+ (PEP 442), __del__ is called safely even for objects in cycles. The GC identifies the unreachable cycle, calls __del__ on each object, then frees them. "deleted" appears twice, both between "before collect" and "after collect" - the deletions happen during gc.collect(), not before it.

Note: the order of the two "deleted" lines is not guaranteed.

Question 3: Which of these objects does the cyclic GC track?

import gc

a = 42
b = "hello"
c = [1, 2, 3]
d = (1, 2, 3)
e = {"x": 1}
f = {1, 2, 3}

# Which of a, b, c, d, e, f are tracked by the GC?

Show Answer

Tracked: c (list), e (dict), f (set)

Not tracked: a (int), b (str), d (tuple containing only immutable objects)

Integers and strings are immutable scalars - they cannot form cycles. Tuples are immutable but tuples that contain mutable objects can be part of cycles (via the mutable objects they contain). An empty tuple or a tuple of ints/strs is not tracked.

You can verify: gc.is_tracked(c) returns True, gc.is_tracked(a) returns False.

Question 4: What does this print?

import gc

gc.disable()

class Node:
    def __init__(self): self.next = None

a = Node()
b = Node()
a.next = b
b.next = a
del a, b

print(gc.get_count())   # ?
count = gc.collect()
print(count)
print(gc.get_count())   # ?

Show Answer

Output (approximately):

(4, 0, 0)
4
(0, 0, 0)

With GC disabled, the 4 unreachable objects (Node(1), Node(2), Node(1).dict, Node(2).dict) accumulate in gen 0 without being collected. gc.get_count() shows 4 objects in gen 0. gc.collect() manually triggers full collection, finding and freeing 4 objects. After collection, gen 0 count returns to 0.

The exact first count may vary depending on how many objects were already in gen 0 before the test.

Question 5: Why does this code have a memory leak, and how do you fix it?

class EventEmitter:
    def __init__(self):
        self._listeners = []

    def add_listener(self, fn):
        self._listeners.append(fn)

    def emit(self, data):
        for fn in self._listeners:
            fn(data)

emitter = EventEmitter()

for _ in range(10000):
    class Handler:
        def on_event(self, data): pass
    h = Handler()
    emitter.add_listener(h.on_event)   # bound method holds strong ref to h

Show Answer

The leak: h.on_event is a bound method. Bound methods hold a strong reference to self (the Handler instance). emitter._listeners holds strong references to all 10,000 bound methods. Each bound method keeps its Handler instance alive. Even when the loop variables go out of scope, the 10,000 Handler instances are retained via the listener list.

Fix using weakref:

import weakref

class EventEmitter:
    def __init__(self):
        self._listeners = []

    def add_listener(self, fn):
        # For bound methods: use WeakMethod
        self._listeners.append(weakref.WeakMethod(fn))

    def emit(self, data):
        live = []
        for ref in self._listeners:
            fn = ref()
            if fn is not None:
                fn(data)
                live.append(ref)
        self._listeners = live   # prune dead refs

emitter = EventEmitter()
for _ in range(10000):
    class Handler:
        def on_event(self, data): pass
    h = Handler()
    emitter.add_listener(h.on_event)
# Handlers are freed when the local 'h' goes out of scope each iteration

weakref.WeakMethod creates a weak reference to a bound method - the method's self object can be GC'd when no other strong references exist.

Level 2 - Debug Challenge

Find and explain all memory issues:

import gc

# Issue 1: GC disabled globally
gc.disable()

class RequestContext:
    _active = {}

    def __init__(self, request_id):
        self.request_id = request_id
        RequestContext._active[request_id] = self   # strong reference in class dict

    def cleanup(self):
        del RequestContext._active[self.request_id]

# Issue 2: Cycle in callback
class DataPipeline:
    def __init__(self):
        self.transformers = []
        self.on_complete = None

    def add_transformer(self, fn):
        self.transformers.append(fn)

pipeline = DataPipeline()

def completion_handler(result):
    pipeline.on_complete = None   # references pipeline - creates cycle

pipeline.on_complete = completion_handler   # pipeline → on_complete → handler → pipeline

# Issue 3: Leaking __del__ objects in Python < 3.4 context
class LegacyResource:
    def __del__(self):
        pass   # has __del__

a = LegacyResource()
b = LegacyResource()
a.ref = b
b.ref = a   # cycle + __del__ - uncollectable in Python < 3.4
del a, b

Show Solution

Issue 1 - GC disabled with accumulating cyclic garbage:

RequestContext._active is a class-level strong-reference dict. If cleanup() is not called (e.g., on exceptions), instances accumulate forever. With GC disabled, any cycles they form are never collected.

Fix:

gc.enable()  # re-enable GC

class RequestContext:
    _active = weakref.WeakValueDictionary()  # instances freed when no other refs exist

    def __init__(self, request_id):
        self.request_id = request_id
        RequestContext._active[request_id] = self

Issue 2 - Cycle via callback:

pipeline → on_complete → completion_handler (closure) → pipeline (captured in closure). This is a direct reference cycle. When pipeline is deleted, it stays alive via the closure.

Fix:

import weakref

pipeline_ref = weakref.ref(pipeline)

def completion_handler(result):
    p = pipeline_ref()
    if p is not None:
        p.on_complete = None  # no cycle - weakref doesn't count

pipeline.on_complete = completion_handler

Issue 3 - __del__ in cycle (Python < 3.4 concern):

In Python < 3.4, LegacyResource instances in cycles with __del__ are uncollectable (gc.garbage fills up). In Python 3.4+, PEP 442 handles this correctly.

Fix for cross-version compatibility:

import weakref

class LegacyResource:
    def __init__(self):
        # Use weakref.finalize instead of __del__
        self._finalizer = weakref.finalize(self, self.__class__._cleanup)

    @staticmethod
    def _cleanup():
        pass  # cleanup logic here - not a method, so no strong ref to self

Level 3 - Design Challenge

Design a ObjectTracker utility that:

Tracks all live instances of a given class using weak references
Provides count() - number of currently alive instances
Provides all() - list of all currently alive instances
Provides diagnose() - for each alive instance, show its gc.get_referrers() summary
Does NOT prevent GC of tracked instances
Works as a class decorator

Show Reference Solution

import gc
import weakref
from typing import Type, TypeVar

T = TypeVar("T")


class ObjectTracker:
    """
    Class decorator that tracks live instances without preventing GC.

    Usage:
        @ObjectTracker
        class MyService:
            def __init__(self, name):
                self.name = name

        s1 = MyService("auth")
        s2 = MyService("db")

        print(MyService.tracker.count())   # 2
        print(MyService.tracker.all())     # [MyService('auth'), MyService('db')]
        del s1
        print(MyService.tracker.count())   # 1
    """

    def __init__(self, cls: Type[T]):
        self._cls = cls
        self._instances: list[weakref.ref] = []
        self._original_init = cls.__init__

        # Patch __init__ to register each new instance
        tracker = self  # capture self for the closure

        def patched_init(instance, *args, **kwargs):
            tracker._original_init(instance, *args, **kwargs)
            ref = weakref.ref(instance, tracker._on_finalize)
            tracker._instances.append(ref)

        cls.__init__ = patched_init
        cls.tracker = self  # attach tracker to the class

        # Preserve the class identity (name, module, docstring)
        self.__name__ = cls.__name__
        self.__qualname__ = cls.__qualname__
        self.__doc__ = cls.__doc__
        self.__module__ = cls.__module__

    def __call__(self, *args, **kwargs):
        """Allow ObjectTracker to be used as the class itself."""
        return self._cls(*args, **kwargs)

    def _on_finalize(self, ref: weakref.ref) -> None:
        """Called by weakref when an instance is GC'd - prune dead refs."""
        self._instances = [r for r in self._instances if r() is not None]

    def _live_instances(self) -> list:
        """Return all instances that are still alive."""
        live = []
        for ref in self._instances:
            obj = ref()
            if obj is not None:
                live.append(obj)
        return live

    def count(self) -> int:
        """Number of currently live instances."""
        return len(self._live_instances())

    def all(self) -> list:
        """All currently live instances."""
        return self._live_instances()

    def diagnose(self) -> None:
        """Print referrer summary for each live instance."""
        instances = self._live_instances()
        print(f"\nObjectTracker.diagnose() - {self.__name__}: {len(instances)} live instances")
        for i, obj in enumerate(instances):
            referrers = gc.get_referrers(obj)
            # Filter out the diagnostic machinery itself
            referrers = [
                r for r in referrers
                if r is not instances and r is not self._instances
            ]
            print(f"\n  Instance {i}: {obj!r}")
            for r in referrers[:5]:   # show top 5 referrers
                print(f"    Held by: {type(r).__name__}", end="")
                if isinstance(r, dict):
                    keys = [k for k, v in r.items() if v is obj]
                    if keys:
                        print(f" (keys: {keys})", end="")
                elif isinstance(r, list):
                    print(f" (len={len(r)})", end="")
                elif isinstance(r, (type, type(None))):
                    print(f" (class {r})", end="")
                print()


# Usage
@ObjectTracker
class Service:
    def __init__(self, name: str):
        self.name = name

    def __repr__(self):
        return f"Service({self.name!r})"


s1 = Service("auth")
s2 = Service("db")
s3 = Service("cache")

print(Service.tracker.count())   # 3
print(Service.tracker.all())     # [Service('auth'), Service('db'), Service('cache')]

del s2
import gc; gc.collect()
print(Service.tracker.count())   # 2

Service.tracker.diagnose()
# Shows referrers for the 2 live Service instances

Design decisions:

Uses weakref.ref with a finalize callback to prune dead refs automatically - no manual cleanup needed
__call__ is implemented so @ObjectTracker does not break Service(...) instantiation syntax
diagnose() filters out the tracker's own machinery from gc.get_referrers() output - otherwise every instance would show the tracker as a referrer, which is expected and unhelpful
_on_finalize is a single cleanup callback registered per-instance - O(n) cleanup but pruning keeps the list compact

Key Takeaways

CPython's cyclic garbage collector exists because reference counting cannot collect reference cycles - objects that reference each other in a loop will never have ob_refcnt reach zero
The GC is generational: three generations (0, 1, 2) with thresholds (700, 10, 10) by default - most objects die in gen 0, long-lived objects migrate to gen 2
The cycle detection algorithm copies refcounts, subtracts internal references via tp_traverse, and identifies objects with zero remaining external references - only container objects (lists, dicts, sets, instances) are tracked
gc.collect() returns the count of all objects freed including __dict__ instances - this is why collecting two Node objects in a cycle returns 4, not 2
gc.disable() is legitimate for performance-critical batch workloads with no cycles, but causes unbounded memory growth if your code creates cyclic garbage
gc.freeze() (Python 3.7+) protects long-lived gen 2 objects from being traversed in forked workers, preventing copy-on-write page faults - used by Instagram, Gunicorn
PEP 442 (Python 3.4+) made __del__ safe for objects in cycles - before 3.4, such objects were uncollectable and accumulated in gc.garbage
gc.get_referrers(obj) is the primary tool for diagnosing unexpected memory retention - it shows every object that holds a reference to obj
Common memory leak patterns: class-level containers, event listeners not removed, evaluated QuerySets cached on long-lived objects, large objects captured in closures
Immutable objects (int, str, tuple of immutables) cannot form cycles and are never tracked by the cyclic GC - reference counting alone handles them with zero GC overhead

What's Next

Lesson 07 covers tracemalloc and memory profiling - the complete toolkit for finding memory leaks in production Python. You will learn to take allocation snapshots, diff them to identify growing allocations, trace leaks to their source line, and interpret tracemalloc output to fix real memory issues in Django, FastAPI, and data pipeline applications.

What You Will Learn​

Prerequisites​

Part 1 - Why Reference Counting Is Not Enough​

The Cycle Problem Revisited​

What Objects the GC Tracks​

Part 2 - Generational Collection​

The Generational Hypothesis​

Generation Thresholds​

Observing Generation Counts​

Part 3 - How Cycle Detection Works​

The Mark-and-Sweep Algorithm​

Why the Count Is 4, Not 2​

Part 4 - The gc Module API​

Core Functions​

gc.get_referrers and gc.get_referents​

Part 5 - When to Disable the GC​

gc.disable() in Production​

Instagram's Approach: gc.disable() at Startup​

Part 6 - del and the Finalizer Problem​

The Resurrection Problem (Python < 3.4)​

PEP 442: Safe Object Finalization (Python 3.4+)​

Checking gc.garbage​

Part 7 - gc.freeze() for Fork-Based Servers​

The Copy-on-Write Problem​

gc.freeze(): Protect Gen 2 from GC Traversal​

Part 8 - Memory Leak Patterns​

Pattern 1: Global Containers That Accumulate​

Pattern 2: Event Listeners Not Removed​

Pattern 3: Django QuerySet Caching​

Pattern 4: Large Objects in Closures​

Part 9 - tracemalloc Preview​

Common Mistakes​

Mistake 1 - Not Collecting Before Checking gc.garbage​

Mistake 2 - Disabling GC Without Understanding Object Graphs​

Mistake 3 - Using gc.collect() as a Performance Fix​

Mistake 4 - Forgetting to Register tp_traverse in C Extensions​

Graded Practice Challenges​

Level 1 - Predict the Output​

Level 2 - Debug Challenge​

Level 3 - Design Challenge​

Key Takeaways​

What's Next​

What You Will Learn

Prerequisites

Part 1 - Why Reference Counting Is Not Enough

The Cycle Problem Revisited

What Objects the GC Tracks

Part 2 - Generational Collection

The Generational Hypothesis

Generation Thresholds

Observing Generation Counts

Part 3 - How Cycle Detection Works

The Mark-and-Sweep Algorithm

Why the Count Is 4, Not 2

Part 4 - The gc Module API

Core Functions

gc.get_referrers and gc.get_referents

Part 5 - When to Disable the GC

gc.disable() in Production

Instagram's Approach: gc.disable() at Startup

Part 6 - del and the Finalizer Problem

The Resurrection Problem (Python < 3.4)

PEP 442: Safe Object Finalization (Python 3.4+)

Checking gc.garbage

Part 7 - gc.freeze() for Fork-Based Servers

The Copy-on-Write Problem

gc.freeze(): Protect Gen 2 from GC Traversal

Part 8 - Memory Leak Patterns

Pattern 1: Global Containers That Accumulate

Pattern 2: Event Listeners Not Removed

Pattern 3: Django QuerySet Caching

Pattern 4: Large Objects in Closures

Part 9 - tracemalloc Preview

Common Mistakes

Mistake 1 - Not Collecting Before Checking gc.garbage

Mistake 2 - Disabling GC Without Understanding Object Graphs

Mistake 3 - Using gc.collect() as a Performance Fix

Mistake 4 - Forgetting to Register tp_traverse in C Extensions

Graded Practice Challenges

Level 1 - Predict the Output

Level 2 - Debug Challenge

Level 3 - Design Challenge

Key Takeaways

What's Next