Garbage Collection - Generational GC, Cycle Detection, and Memory Leak Diagnosis
Reading time: ~30 minutes | Level: Intermediate → Engineering
Before reading further, predict the output:
import gc
class Node:
def __init__(self, val):
self.val = val
self.next = None
a = Node(1)
b = Node(2)
a.next = b
b.next = a # cycle
del a, b
print(gc.collect()) # ?
Show Answer
Output: 4
Most engineers expect 2 - there are two Node objects in the cycle. But gc.collect() returns the count of all unreachable objects that were collected, including:
- The
Node(1)instance - The
Node(2)instance - The
dictofNode(1).__dict__(instance attribute dictionary) - The
dictofNode(2).__dict__(instance attribute dictionary)
Each Python class instance stores its attributes in a __dict__. Those dicts are themselves container objects tracked by the GC. Both dicts are part of the cycle:
Node(1) → __dict__ → {"val": 1, "next": Node(2)}
Node(2) → __dict__ → {"val": 2, "next": Node(1)}
The cycle involves 4 container objects, so gc.collect() returns 4. Understanding this requires knowing which objects the GC tracks and how cycle detection counts its results.
Reference counting is CPython's primary memory management mechanism (Lesson 05), but it has a fundamental blind spot: reference cycles. Two objects that reference each other will never have their ob_refcnt reach zero, even when the rest of the program has forgotten about them entirely. Python's cyclic garbage collector exists to handle exactly this case. Understanding it - its generations, its thresholds, its collection algorithm, and its production knobs - is essential for writing leak-free long-running Python services.
What You Will Learn
- Why reference counting is insufficient: the reference cycle problem
- CPython's cyclic GC: generational, three generations (gen 0, 1, 2)
- How cycle detection works: the mark-and-sweep algorithm for container objects
- The
gcmodule:collect(),get_count(),get_threshold(),set_threshold() gc.disable()andgc.enable(): when and how to disable the GC safely__del__and the finalizer resurrection problem - and how PEP 442 solved itgc.get_referrers()andgc.get_referents()for debugging memory leaksgc.freeze()(Python 3.7+): why Instagram and Gunicorn use it before forking- Memory leak patterns: event listeners, class attributes, QuerySet caching
tracemalloc: preview of the full memory profiling lesson
Prerequisites
- Lesson 05: Reference Counting - you need to understand
ob_refcntand why cycles defeat it - Lesson 01: CPython Architecture - Python objects as C structs
- Familiarity with
gcmodule basics
Part 1 - Why Reference Counting Is Not Enough
The Cycle Problem Revisited
As established in Lesson 05: when two objects reference each other, their ob_refcnt fields never reach zero, even after all external references are deleted:
import sys, gc
gc.disable() # prevent GC from interfering with this demonstration
class Node:
def __init__(self, val):
self.val = val
self.next = None
a = Node(1)
b = Node(2)
a.next = b
b.next = a
# External references: 'a' and 'b' names
print(sys.getrefcount(a) - 1) # 2: name 'a' + b.next
print(sys.getrefcount(b) - 1) # 2: name 'b' + a.next
del a, b
# External references removed - but cycles remain:
# Node(1).__dict__["next"] → Node(2) - refcount still 1
# Node(2).__dict__["next"] → Node(1) - refcount still 1
# Both nodes are UNREACHABLE but UNFREE-able by reference counting
# Only the GC can detect and collect these
print(gc.collect()) # 4 - collected Node(1), Node(2), and their __dicts__
gc.enable()
What Objects the GC Tracks
The cyclic GC only tracks container objects - objects that can hold references to other objects:
list,tuple,dict,set,frozenset- Class instances (anything with
__dict__) - Custom classes that implement
tp_traversein C function,code,frameobjects
The GC does not track:
int,float,bool,complex- scalar values cannot form cyclesstr,bytes,bytearray- immutable, cannot contain referencesNone,True,False- singletons
This distinction is critical: immutable scalars are always collected immediately by reference counting; the GC only needs to handle containers.
Part 2 - Generational Collection
The Generational Hypothesis
The cyclic GC uses a generational strategy based on the empirical observation that:
Most objects die young. Objects that survive multiple GC cycles are likely to survive many more.
This is the generational hypothesis, first described for Lisp and Smalltalk in the 1980s and now used by virtually every production garbage collector.
CPython implements three generations:
| Generation | Label | Objects | Collected when |
|---|---|---|---|
| 0 | Young | Newly allocated containers | Most frequently |
| 1 | Middle | Survived one gen 0 collection | Less frequently |
| 2 | Old | Long-lived objects | Least frequently |
Generation Thresholds
import gc
# Default thresholds: (700, 10, 10)
print(gc.get_threshold()) # (700, 10, 10)
# Interpretation:
# - Collect gen 0 when (allocations - deallocations) > 700
# - Collect gen 1 after gen 0 has been collected 10 times
# - Collect gen 2 after gen 1 has been collected 10 times
# Default: gen 2 collected every 700 * 10 * 10 = 70,000 net allocations
The threshold (700, 10, 10) means:
- Gen 0: collected when the number of tracked objects allocated since the last collection exceeds 700
- Gen 1: collected every 10th time gen 0 is collected
- Gen 2: collected every 10th time gen 1 is collected
Observing Generation Counts
import gc
# gc.get_count() returns (gen0_count, gen1_count, gen2_count)
# These are the number of tracked objects in each generation
print(gc.get_count()) # e.g., (42, 3, 1)
# Allocate a bunch of objects and watch gen 0 grow
for _ in range(100):
x = [i for i in range(10)] # allocates list + inner list
print(gc.get_count()) # gen 0 count grew
# Force a collection
collected = gc.collect(0) # collect only gen 0
print(f"Gen 0 collected: {collected} objects")
print(gc.get_count())
collected = gc.collect(1) # collect gen 0 + gen 1
print(f"Gen 1 collected: {collected} objects")
collected = gc.collect(2) # full collection (all generations)
print(f"Full collection: {collected} objects")
Part 3 - How Cycle Detection Works
The Mark-and-Sweep Algorithm
CPython's cycle detector uses a modified mark-and-sweep algorithm. Here is how it works conceptually:
Step 1: Copy refcounts
For every tracked object in the generation being collected, copy ob_refcnt into a shadow field (gc_refs). This working copy is modified during detection without affecting the actual object graph.
Step 2: Subtract internal references
For every object, traverse all references it holds (via tp_traverse) and decrement gc_refs of each referenced object. After this step, gc_refs counts only external references - references from objects outside the current generation.
Step 3: Identify unreachable objects
Objects with gc_refs == 0 after step 2 have no external references. They can only be referenced from within the generation itself - they are part of unreachable cycles.
Step 4: Expand the unreachable set Objects that are referenced by unreachable objects are also unreachable (if they have no other external references). Mark them unreachable too.
Step 5: Deallocate
Call tp_dealloc on all unreachable objects.
Why the Count Is 4, Not 2
This explains the opening puzzle: gc.collect() returned 4 because the cycle involves four container objects - the two Node instances and their two __dict__ instances. Each class instance's attributes are stored in a dict, which is itself a tracked container object participating in the cycle.
Part 4 - The gc Module API
Core Functions
import gc
# --- Collection ---
# Collect all generations (equivalent to gc.collect(2))
n = gc.collect()
print(f"Collected {n} objects")
# Collect specific generation (0, 1, or 2)
gc.collect(0) # gen 0 only (fastest)
gc.collect(1) # gen 0 + gen 1
gc.collect(2) # all generations (slowest, most thorough)
# --- Inspection ---
# Count of tracked objects in each generation
print(gc.get_count()) # (gen0_objects, gen1_objects, gen2_objects)
# Current collection thresholds
print(gc.get_threshold()) # (700, 10, 10) by default
# Set thresholds
gc.set_threshold(1000, 15, 15) # less frequent collection
# --- Control ---
gc.disable() # stop automatic collection
gc.enable() # resume automatic collection
print(gc.isenabled()) # True/False
# --- Statistics (Python 3.3+) ---
stats = gc.get_stats()
# Returns list of 3 dicts, one per generation:
# [{'collections': N, 'collected': N, 'uncollectable': N}, ...]
for gen, stat in enumerate(stats):
print(f"Gen {gen}: {stat}")
gc.get_referrers and gc.get_referents
These are the most powerful tools for debugging memory leaks:
import gc
class LeakyService:
instances = [] # class-level list - dangerous!
def __init__(self, name):
self.name = name
LeakyService.instances.append(self) # keeps all instances alive forever
svc1 = LeakyService("auth")
svc2 = LeakyService("db")
# Who is referencing svc1?
referrers = gc.get_referrers(svc1)
for r in referrers:
print(f"Referrer type: {type(r).__name__}")
if isinstance(r, list):
print(f" List id={id(r)}, len={len(r)}")
elif isinstance(r, dict):
keys = [k for k, v in r.items() if v is svc1]
print(f" Dict keys pointing to svc1: {keys}")
# Will show: LeakyService.instances list is holding svc1
# What does svc1 refer to?
referents = gc.get_referents(svc1)
for ref in referents:
print(f"Referent: {type(ref).__name__}: {ref!r}")
# Will show: svc1's __dict__, svc1's __class__, etc.
:::tip Use gc.get_threshold() to Understand When Collections Trigger In batch workloads (data pipelines, one-shot scripts), you may want to tune GC thresholds. Increase thresholds to collect less frequently (lower CPU overhead, higher peak memory). Decrease to collect more aggressively (lower peak memory, higher CPU overhead). Measure both before changing defaults.
import gc
# For a batch job that allocates many short-lived objects:
# Raise threshold to reduce GC overhead during the hot path
gc.set_threshold(10000, 20, 20)
run_batch_job()
gc.collect() # force full collection at end
:::
Part 5 - When to Disable the GC
gc.disable() in Production
Disabling the GC is legitimate in specific scenarios:
import gc
# Pattern 1: Disable during performance-critical initialization
gc.disable()
try:
# Build large data structures - no GC interruptions
data = {i: [j for j in range(100)] for i in range(10000)}
index = build_search_index(data)
finally:
gc.enable()
gc.collect() # clean up any cycles created during initialization
# Pattern 2: Long-running batch job with no cycles (all objects are short-lived)
# Reference counting handles everything - GC overhead adds up over millions of allocations
gc.disable()
for record in massive_dataset:
result = process(record) # record and result have no cycles
write(result)
gc.enable()
gc.collect()
Instagram's Approach: gc.disable() at Startup
Instagram's engineering team (Pythonista Brett Slatkin and others) documented disabling the GC in their Django/uWSGI stack:
- Before forking workers, they call
gc.disable()andgc.collect()to ensure a clean heap - Worker processes never run GC automatically - they rely on reference counting only
- This reduced CPU overhead significantly for their workload (mostly short-lived request objects with no cycles)
- Periodically they restart workers, which naturally clears any leaked memory
This is only safe if your workload genuinely produces no reference cycles (or you accept that cyclic garbage accumulates until process restart).
:::danger gc.disable() in Production Without Understanding Your Memory Patterns Causes Unbounded Growth Disabling the GC means cycles accumulate forever. In a web server handling requests that create Django ORM objects, each request may create many cycles (ORM instances that reference their model class, which references the module, creating large retained graphs). With GC disabled, these never get collected. Memory grows without bound until the process is killed.
Before disabling the GC, verify with tracemalloc (Lesson 07) that your workload does not create significant cyclic garbage.
:::
:::note Immutable Objects Cannot Form Cycles - The GC Ignores Them
int, float, str, bytes, bool, None, tuple containing only immutable objects - none of these can form reference cycles. CPython does not register them with the cyclic GC. They are always collected immediately when their ob_refcnt reaches zero, with no GC overhead. If your workload is dominated by these types, the cyclic GC's overhead is minimal regardless of threshold settings.
:::
Part 6 - del and the Finalizer Problem
The Resurrection Problem (Python < 3.4)
Before Python 3.4, objects with __del__ methods that were part of reference cycles were problematic:
# The problem in Python < 3.4 (historical, for understanding)
class Problematic:
def __del__(self):
# What if __del__ stores 'self' somewhere?
# The GC would have to collect it, but what if __del__ revives it?
global resurrected
resurrected = self # "resurrection" - object escapes deletion!
a = Problematic()
b = Problematic()
a.other = b
b.other = a # cycle
del a, b
# Pre-3.4: CPython adds these to gc.garbage - uncollectable!
# gc.collect() returns 0 - the objects are NOT freed
# gc.garbage is a list of uncollectable objects
print(gc.garbage) # [<Problematic object>, <Problematic object>]
PEP 442: Safe Object Finalization (Python 3.4+)
PEP 442 (implemented in Python 3.4) resolved the finalizer problem:
- The GC identifies the unreachable set (objects only reachable from each other)
- For each unreachable object with
__del__, the GC calls__del__before freeing memory - After all finalizers run, the GC checks if any objects were "resurrected" (new strong references created)
- Objects that were not resurrected are freed
- Objects that were resurrected survive (their refcount is now > 0 again)
import gc
class SafeFinalizer:
def __init__(self, name):
self.name = name
def __del__(self):
print(f"__del__ called on {self.name}")
# Modern Python (3.4+): __del__ is called safely even in cycles
a = SafeFinalizer("A")
b = SafeFinalizer("B")
a.other = b
b.other = a # cycle
del a, b
gc.collect()
# Output:
# __del__ called on B
# __del__ called on A
# Both freed correctly - no gc.garbage accumulation
Checking gc.garbage
gc.garbage is a list where CPython puts uncollectable objects (primarily objects with __del__ in cycles that resurrect themselves):
import gc
print(gc.garbage) # should be [] in well-written code
# If gc.garbage is non-empty, you have uncollectable objects
if gc.garbage:
print(f"WARNING: {len(gc.garbage)} uncollectable objects")
for obj in gc.garbage:
print(f" {type(obj).__name__}: {repr(obj)[:100]}")
# Decide: clear it (losing the objects) or investigate
gc.garbage.clear()
:::warning Objects with del in Python < 3.4 Are Uncollectable If in a Cycle
If you must support Python < 3.4, or work with C extensions using the old finalizer protocol, avoid __del__ on objects that participate in cycles. Use weakref.finalize instead (available since Python 3.4) - it does not block cycle collection.
import weakref
class LegacySafeResource:
def __init__(self, resource):
self._resource = resource
# weakref.finalize does NOT block GC of LegacySafeResource
self._finalizer = weakref.finalize(self, cleanup, resource)
def cleanup(resource):
resource.close()
:::
Part 7 - gc.freeze() for Fork-Based Servers
The Copy-on-Write Problem
Many Python web servers (uWSGI, Gunicorn) use os.fork() to create worker processes. Fork copies the parent's memory pages to the child, but modern OS kernels use copy-on-write (COW) - pages are only physically copied when written.
If the GC runs in a worker process and touches many objects (incrementing/decrementing shadow refcount fields during traversal), it triggers writes to pages that were previously shared with the parent. This causes COW pages to be copied, increasing memory usage per worker process.
# In a Gunicorn/uWSGI server:
# 1. Parent process loads Django: 100MB of module-level objects in gen 2
# 2. Fork 8 workers
# 3. If GC runs in workers (scanning gen 2), it writes to gen 2's gc_refs
# 4. COW: those pages are now private per-worker - 100MB × 8 workers = 800MB
# 5. Without GC touching gen 2: pages stay shared - 100MB + small per-worker delta
gc.freeze(): Protect Gen 2 from GC Traversal
gc.freeze() (Python 3.7+) moves all objects currently in all three generations into a permanent generation that is never traversed by the cyclic GC:
import gc
# Before forking: move all current objects to "frozen" generation
# These are module-level objects: Django models, URL patterns, middleware, etc.
gc.freeze()
# Now fork workers
# In each worker: the GC only tracks NEW objects (those created after fork)
# Frozen gen 2 objects are never traversed - COW pages stay shared
# After workers start:
# gc.get_freeze_count() shows how many objects are frozen
print(gc.get_freeze_count()) # e.g., 50000
# To unfreeze (rarely needed):
# gc.unfreeze()
This is used by Instagram, Yelp, and other high-traffic Python services running on Gunicorn. The memory savings are significant: on servers with 32 workers, gc.freeze() before fork can reduce memory usage by hundreds of megabytes.
# Typical Gunicorn pre-fork hook pattern:
# gunicorn_config.py
def pre_fork(server, worker):
pass
def when_ready(server):
"""Called after Django is loaded, before forking workers."""
import gc
gc.freeze()
server.log.info(f"GC frozen: {gc.get_freeze_count()} objects protected from COW")
Part 8 - Memory Leak Patterns
Pattern 1: Global Containers That Accumulate
# Classic leak: class-level container that grows unboundedly
class RequestHandler:
_all_handlers = [] # class-level - holds ALL instances ever created
def __init__(self, request_id):
self.request_id = request_id
RequestHandler._all_handlers.append(self) # strong reference forever!
def handle(self):
return f"Handling {self.request_id}"
# In a web server: each request creates a RequestHandler
# They accumulate in _all_handlers and are NEVER freed
# Memory grows linearly with request count - server eventually OOMs
# Fix: use WeakValueDictionary or don't store in class-level container
import weakref
class RequestHandler:
_all_handlers = weakref.WeakValueDictionary()
def __init__(self, request_id):
self.request_id = request_id
RequestHandler._all_handlers[request_id] = self
# freed when no other references hold the handler
Pattern 2: Event Listeners Not Removed
class EventBus:
def __init__(self):
self._listeners = {} # event_type → list of listeners
def on(self, event_type, listener):
self._listeners.setdefault(event_type, []).append(listener)
def emit(self, event_type, data):
for listener in self._listeners.get(event_type, []):
listener(data)
bus = EventBus()
class Widget:
def __init__(self, name):
self.name = name
bus.on("click", self.handle_click) # bus holds strong ref to self.handle_click
def handle_click(self, data):
print(f"{self.name} clicked: {data}")
# Widget instances can never be GC'd as long as bus exists
# Each Widget's bound method holds a reference to the Widget instance
# bus._listeners["click"] → [Widget.handle_click] → Widget
w1 = Widget("button1")
del w1 # does NOT free w1 - bus still holds a reference via bound method
# Fix: use weakref for listener storage
import weakref
class EventBus:
def __init__(self):
self._listeners: dict[str, list[weakref.ref]] = {}
def on(self, event_type, listener):
ref = weakref.ref(listener, lambda r: self._remove_dead(event_type, r))
self._listeners.setdefault(event_type, []).append(ref)
def _remove_dead(self, event_type, ref):
if event_type in self._listeners:
self._listeners[event_type] = [
r for r in self._listeners[event_type] if r is not ref
]
def emit(self, event_type, data):
for ref in list(self._listeners.get(event_type, [])):
listener = ref()
if listener is not None:
listener(data)
Pattern 3: Django QuerySet Caching
# Dangerous in class scope or module scope
class ReportGenerator:
# This QuerySet is evaluated and CACHED at class definition time
# All rows from the database are loaded into memory and held forever
all_products = Product.objects.filter(active=True) # NOT a leak in isolation
def generate(self):
# But if this is stored in a long-lived object, it holds all rows
return [format_row(p) for p in self.all_products]
# In practice: Django QuerySets evaluate lazily, but once evaluated,
# the result cache stays on the QuerySet object.
# If the QuerySet is on a class attribute, the result cache never GCs.
# Fix: never store evaluated QuerySets in class attributes
class ReportGenerator:
def generate(self):
# Fresh QuerySet each time - evaluated, used, discarded
products = Product.objects.filter(active=True)
return [format_row(p) for p in products] # products collected after function returns
Pattern 4: Large Objects in Closures
import sys
def make_processor(large_df):
"""large_df: a pandas DataFrame with millions of rows."""
def process(key):
return large_df.loc[key] # closure captures the entire DataFrame
return process # large_df is held in process.__closure__
processor = make_processor(load_huge_dataframe())
# large_df cannot be GC'd - processor.__closure__[0].cell_contents IS the DataFrame
# Fix: extract only what you need before closing over it
def make_processor(large_df):
lookup = large_df.set_index("key")["value"].to_dict() # small dict
large_df = None # explicitly clear the reference before closure is returned
def process(key):
return lookup.get(key) # captures small dict, not the DataFrame
return process
Part 9 - tracemalloc Preview
The tracemalloc module (covered fully in Lesson 07) tracks memory allocations at the Python level, enabling precise leak diagnosis:
import tracemalloc
import gc
# Start tracing memory allocations
tracemalloc.start()
# Take a snapshot before the suspect operation
snapshot1 = tracemalloc.take_snapshot()
# Run the suspect code
for _ in range(1000):
create_some_objects() # might leak
gc.collect() # force GC so only true leaks remain
# Take a snapshot after
snapshot2 = tracemalloc.take_snapshot()
# Compare - what grew?
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top memory growth:")
for stat in top_stats[:10]:
print(f" {stat}")
tracemalloc.stop()
The output shows exactly which line of code is responsible for the memory growth, making leak diagnosis tractable in production code.
Common Mistakes
Mistake 1 - Not Collecting Before Checking gc.garbage
# Wrong: gc.garbage may not be populated until you collect
import gc
print(gc.garbage) # [] - but there may be cyclic garbage waiting
# Right: force collection first
gc.collect()
if gc.garbage:
print(f"Uncollectable objects: {len(gc.garbage)}")
Mistake 2 - Disabling GC Without Understanding Object Graphs
# Wrong: disable GC without verifying no cycles exist
gc.disable()
run_web_server() # Django ORM creates cycles - memory grows unboundedly
# Right: profile first, then decide
import tracemalloc
tracemalloc.start()
gc.disable()
run_for_100_requests()
gc.collect()
snap = tracemalloc.take_snapshot()
# If significant memory is collected by gc.collect(), GC is needed
tracemalloc.stop()
gc.enable()
Mistake 3 - Using gc.collect() as a Performance Fix
# Wrong: calling gc.collect() in a hot loop "just in case"
for record in dataset:
process(record)
gc.collect() # 1000x slower - gc.collect() is expensive (scans all tracked objects)
# Right: let the GC work automatically; call gc.collect() only when necessary
for record in dataset:
process(record)
gc.collect() # once, at the end, if needed
Mistake 4 - Forgetting to Register tp_traverse in C Extensions
When writing a C extension that contains a container (a C struct that holds PyObject* pointers), failing to implement tp_traverse means the GC cannot see the objects it holds. They appear to have no references and are never reached by the cycle detector - memory leak:
// Wrong: no tp_traverse - GC can't see the contained objects
static PyTypeObject MyType = {
.tp_name = "MyContainer",
// tp_traverse = NULL ← GC cannot traverse this type!
};
// Right: implement tp_traverse
static int my_traverse(MyObject *self, visitproc visit, void *arg) {
Py_VISIT(self->contained_obj); // tell GC about every held reference
return 0;
}
static PyTypeObject MyType = {
.tp_traverse = (traverseproc)my_traverse,
.tp_flags = Py_TPFLAGS_HAVE_GC, // must also set this flag
};
Graded Practice Challenges
Level 1 - Predict the Output
Question 1: What does this print?
import gc
print(gc.get_threshold())
Show Answer
Output: (700, 10, 10)
These are CPython's default collection thresholds. Gen 0 is collected when the count of tracked objects (allocations minus deallocations since last collection) exceeds 700. Gen 1 is collected every 10th gen 0 collection. Gen 2 every 10th gen 1 collection.
Question 2: What does this print?
import gc
class WithDel:
def __del__(self):
print("deleted")
a = WithDel()
b = WithDel()
a.ref = b
b.ref = a
del a, b
print("before collect")
gc.collect()
print("after collect")
Show Answer
Output (Python 3.4+):
before collect
deleted
deleted
after collect
In Python 3.4+ (PEP 442), __del__ is called safely even for objects in cycles. The GC identifies the unreachable cycle, calls __del__ on each object, then frees them. "deleted" appears twice, both between "before collect" and "after collect" - the deletions happen during gc.collect(), not before it.
Note: the order of the two "deleted" lines is not guaranteed.
Question 3: Which of these objects does the cyclic GC track?
import gc
a = 42
b = "hello"
c = [1, 2, 3]
d = (1, 2, 3)
e = {"x": 1}
f = {1, 2, 3}
# Which of a, b, c, d, e, f are tracked by the GC?
Show Answer
Tracked: c (list), e (dict), f (set)
Not tracked: a (int), b (str), d (tuple containing only immutable objects)
Integers and strings are immutable scalars - they cannot form cycles. Tuples are immutable but tuples that contain mutable objects can be part of cycles (via the mutable objects they contain). An empty tuple or a tuple of ints/strs is not tracked.
You can verify: gc.is_tracked(c) returns True, gc.is_tracked(a) returns False.
Question 4: What does this print?
import gc
gc.disable()
class Node:
def __init__(self): self.next = None
a = Node()
b = Node()
a.next = b
b.next = a
del a, b
print(gc.get_count()) # ?
count = gc.collect()
print(count)
print(gc.get_count()) # ?
Show Answer
Output (approximately):
(4, 0, 0)
4
(0, 0, 0)
With GC disabled, the 4 unreachable objects (Node(1), Node(2), Node(1).dict, Node(2).dict) accumulate in gen 0 without being collected. gc.get_count() shows 4 objects in gen 0. gc.collect() manually triggers full collection, finding and freeing 4 objects. After collection, gen 0 count returns to 0.
The exact first count may vary depending on how many objects were already in gen 0 before the test.
Question 5: Why does this code have a memory leak, and how do you fix it?
class EventEmitter:
def __init__(self):
self._listeners = []
def add_listener(self, fn):
self._listeners.append(fn)
def emit(self, data):
for fn in self._listeners:
fn(data)
emitter = EventEmitter()
for _ in range(10000):
class Handler:
def on_event(self, data): pass
h = Handler()
emitter.add_listener(h.on_event) # bound method holds strong ref to h
Show Answer
The leak: h.on_event is a bound method. Bound methods hold a strong reference to self (the Handler instance). emitter._listeners holds strong references to all 10,000 bound methods. Each bound method keeps its Handler instance alive. Even when the loop variables go out of scope, the 10,000 Handler instances are retained via the listener list.
Fix using weakref:
import weakref
class EventEmitter:
def __init__(self):
self._listeners = []
def add_listener(self, fn):
# For bound methods: use WeakMethod
self._listeners.append(weakref.WeakMethod(fn))
def emit(self, data):
live = []
for ref in self._listeners:
fn = ref()
if fn is not None:
fn(data)
live.append(ref)
self._listeners = live # prune dead refs
emitter = EventEmitter()
for _ in range(10000):
class Handler:
def on_event(self, data): pass
h = Handler()
emitter.add_listener(h.on_event)
# Handlers are freed when the local 'h' goes out of scope each iteration
weakref.WeakMethod creates a weak reference to a bound method - the method's self object can be GC'd when no other strong references exist.
Level 2 - Debug Challenge
Find and explain all memory issues:
import gc
# Issue 1: GC disabled globally
gc.disable()
class RequestContext:
_active = {}
def __init__(self, request_id):
self.request_id = request_id
RequestContext._active[request_id] = self # strong reference in class dict
def cleanup(self):
del RequestContext._active[self.request_id]
# Issue 2: Cycle in callback
class DataPipeline:
def __init__(self):
self.transformers = []
self.on_complete = None
def add_transformer(self, fn):
self.transformers.append(fn)
pipeline = DataPipeline()
def completion_handler(result):
pipeline.on_complete = None # references pipeline - creates cycle
pipeline.on_complete = completion_handler # pipeline → on_complete → handler → pipeline
# Issue 3: Leaking __del__ objects in Python < 3.4 context
class LegacyResource:
def __del__(self):
pass # has __del__
a = LegacyResource()
b = LegacyResource()
a.ref = b
b.ref = a # cycle + __del__ - uncollectable in Python < 3.4
del a, b
Show Solution
Issue 1 - GC disabled with accumulating cyclic garbage:
RequestContext._active is a class-level strong-reference dict. If cleanup() is not called (e.g., on exceptions), instances accumulate forever. With GC disabled, any cycles they form are never collected.
Fix:
gc.enable() # re-enable GC
class RequestContext:
_active = weakref.WeakValueDictionary() # instances freed when no other refs exist
def __init__(self, request_id):
self.request_id = request_id
RequestContext._active[request_id] = self
Issue 2 - Cycle via callback:
pipeline → on_complete → completion_handler (closure) → pipeline (captured in closure). This is a direct reference cycle. When pipeline is deleted, it stays alive via the closure.
Fix:
import weakref
pipeline_ref = weakref.ref(pipeline)
def completion_handler(result):
p = pipeline_ref()
if p is not None:
p.on_complete = None # no cycle - weakref doesn't count
pipeline.on_complete = completion_handler
Issue 3 - __del__ in cycle (Python < 3.4 concern):
In Python < 3.4, LegacyResource instances in cycles with __del__ are uncollectable (gc.garbage fills up). In Python 3.4+, PEP 442 handles this correctly.
Fix for cross-version compatibility:
import weakref
class LegacyResource:
def __init__(self):
# Use weakref.finalize instead of __del__
self._finalizer = weakref.finalize(self, self.__class__._cleanup)
@staticmethod
def _cleanup():
pass # cleanup logic here - not a method, so no strong ref to self
Level 3 - Design Challenge
Design a ObjectTracker utility that:
- Tracks all live instances of a given class using weak references
- Provides
count()- number of currently alive instances - Provides
all()- list of all currently alive instances - Provides
diagnose()- for each alive instance, show itsgc.get_referrers()summary - Does NOT prevent GC of tracked instances
- Works as a class decorator
Show Reference Solution
import gc
import weakref
from typing import Type, TypeVar
T = TypeVar("T")
class ObjectTracker:
"""
Class decorator that tracks live instances without preventing GC.
Usage:
@ObjectTracker
class MyService:
def __init__(self, name):
self.name = name
s1 = MyService("auth")
s2 = MyService("db")
print(MyService.tracker.count()) # 2
print(MyService.tracker.all()) # [MyService('auth'), MyService('db')]
del s1
print(MyService.tracker.count()) # 1
"""
def __init__(self, cls: Type[T]):
self._cls = cls
self._instances: list[weakref.ref] = []
self._original_init = cls.__init__
# Patch __init__ to register each new instance
tracker = self # capture self for the closure
def patched_init(instance, *args, **kwargs):
tracker._original_init(instance, *args, **kwargs)
ref = weakref.ref(instance, tracker._on_finalize)
tracker._instances.append(ref)
cls.__init__ = patched_init
cls.tracker = self # attach tracker to the class
# Preserve the class identity (name, module, docstring)
self.__name__ = cls.__name__
self.__qualname__ = cls.__qualname__
self.__doc__ = cls.__doc__
self.__module__ = cls.__module__
def __call__(self, *args, **kwargs):
"""Allow ObjectTracker to be used as the class itself."""
return self._cls(*args, **kwargs)
def _on_finalize(self, ref: weakref.ref) -> None:
"""Called by weakref when an instance is GC'd - prune dead refs."""
self._instances = [r for r in self._instances if r() is not None]
def _live_instances(self) -> list:
"""Return all instances that are still alive."""
live = []
for ref in self._instances:
obj = ref()
if obj is not None:
live.append(obj)
return live
def count(self) -> int:
"""Number of currently live instances."""
return len(self._live_instances())
def all(self) -> list:
"""All currently live instances."""
return self._live_instances()
def diagnose(self) -> None:
"""Print referrer summary for each live instance."""
instances = self._live_instances()
print(f"\nObjectTracker.diagnose() - {self.__name__}: {len(instances)} live instances")
for i, obj in enumerate(instances):
referrers = gc.get_referrers(obj)
# Filter out the diagnostic machinery itself
referrers = [
r for r in referrers
if r is not instances and r is not self._instances
]
print(f"\n Instance {i}: {obj!r}")
for r in referrers[:5]: # show top 5 referrers
print(f" Held by: {type(r).__name__}", end="")
if isinstance(r, dict):
keys = [k for k, v in r.items() if v is obj]
if keys:
print(f" (keys: {keys})", end="")
elif isinstance(r, list):
print(f" (len={len(r)})", end="")
elif isinstance(r, (type, type(None))):
print(f" (class {r})", end="")
print()
# Usage
@ObjectTracker
class Service:
def __init__(self, name: str):
self.name = name
def __repr__(self):
return f"Service({self.name!r})"
s1 = Service("auth")
s2 = Service("db")
s3 = Service("cache")
print(Service.tracker.count()) # 3
print(Service.tracker.all()) # [Service('auth'), Service('db'), Service('cache')]
del s2
import gc; gc.collect()
print(Service.tracker.count()) # 2
Service.tracker.diagnose()
# Shows referrers for the 2 live Service instances
Design decisions:
- Uses
weakref.refwith a finalize callback to prune dead refs automatically - no manual cleanup needed __call__is implemented so@ObjectTrackerdoes not breakService(...)instantiation syntaxdiagnose()filters out the tracker's own machinery fromgc.get_referrers()output - otherwise every instance would show the tracker as a referrer, which is expected and unhelpful_on_finalizeis a single cleanup callback registered per-instance - O(n) cleanup but pruning keeps the list compact
Key Takeaways
- CPython's cyclic garbage collector exists because reference counting cannot collect reference cycles - objects that reference each other in a loop will never have
ob_refcntreach zero - The GC is generational: three generations (0, 1, 2) with thresholds
(700, 10, 10)by default - most objects die in gen 0, long-lived objects migrate to gen 2 - The cycle detection algorithm copies refcounts, subtracts internal references via
tp_traverse, and identifies objects with zero remaining external references - only container objects (lists, dicts, sets, instances) are tracked gc.collect()returns the count of all objects freed including__dict__instances - this is why collecting twoNodeobjects in a cycle returns4, not2gc.disable()is legitimate for performance-critical batch workloads with no cycles, but causes unbounded memory growth if your code creates cyclic garbagegc.freeze()(Python 3.7+) protects long-lived gen 2 objects from being traversed in forked workers, preventing copy-on-write page faults - used by Instagram, Gunicorn- PEP 442 (Python 3.4+) made
__del__safe for objects in cycles - before 3.4, such objects were uncollectable and accumulated ingc.garbage gc.get_referrers(obj)is the primary tool for diagnosing unexpected memory retention - it shows every object that holds a reference toobj- Common memory leak patterns: class-level containers, event listeners not removed, evaluated QuerySets cached on long-lived objects, large objects captured in closures
- Immutable objects (
int,str,tupleof immutables) cannot form cycles and are never tracked by the cyclic GC - reference counting alone handles them with zero GC overhead
What's Next
Lesson 07 covers tracemalloc and memory profiling - the complete toolkit for finding memory leaks in production Python. You will learn to take allocation snapshots, diff them to identify growing allocations, trace leaks to their source line, and interpret tracemalloc output to fix real memory issues in Django, FastAPI, and data pipeline applications.
