Disassembly with dis - Reading CPython Bytecode
Reading time: ~40 minutes | Level: Intermediate → Engineering
Before reading further, predict what dis.dis(mystery) will print, and what mystery(3) and mystery(-2) will return:
import dis
def mystery(x):
return x * 2 if x > 0 else -x
dis.dis(mystery)
Write out the opcode sequence you expect to see, in order.
Almost no one gets this right without having read bytecode before. The actual dis.dis output (Python 3.12):
2 RESUME 0
3 LOAD_FAST 0 (x)
LOAD_CONST 1 (2)
BINARY_OP 5 (*)
LOAD_FAST 0 (x)
LOAD_CONST 2 (0)
COMPARE_OP 4 (>)
POP_JUMP_IF_FALSE 3 (to 26)
RETURN_VALUE
>> 26 LOAD_FAST 0 (x)
UNARY_NEGATIVE
RETURN_VALUE
mystery(3) returns 6. mystery(-2) returns 2.
Three things should surprise you here. First: the conditional expression evaluates the "true" branch first, then jumps over it if the condition is false - it is backward from how you read it in source. Second: x * 2 is computed before the x > 0 check, not after. Third: there are two RETURN_VALUE instructions - CPython generates one per branch rather than one shared exit point.
Once you understand why, you understand something about how CPython's compiler works. That is what this lesson builds.
What You Will Learn
- The
dis.dis(),dis.disassemble(), anddis.get_instructions()API - How to read disassembly output: offset, line number, opcode name, argument, comment
- The key opcodes and what they do on the value stack
- Value stack evolution step by step through a real example
- How equivalent Python patterns compile differently (or identically)
- Practical performance insights from bytecode comparison
- The
dis.Bytecodeobject for structured programmatic access
Prerequisites
- Lesson 01: CPython Architecture (the eval loop, the value stack)
- Lesson 02: Bytecode Inspection (the code object and its attributes)
Part 1 - The dis Module API
dis.dis() - Human-Readable Disassembly
dis.dis() disassembles a function, method, class, module, string of source, or bytes-like object and prints the result:
import dis
def add(a, b):
return a + b
dis.dis(add)
Output:
2 RESUME 0
3 LOAD_FAST 0 (a)
LOAD_FAST 1 (b)
BINARY_OP 0 (+)
RETURN_VALUE
Reading the Output Format
Each line has up to five fields:
3 LOAD_FAST 0 (a)
^ ^ ^ ^
| | | +-- human-readable comment (variable name, const value)
| | +---- opcode argument (index or value)
| +----------------------------- opcode name
+----------------------------------------- source line number (only shown at first instruction per line)
There is also an optional >> prefix indicating a jump target:
>> 26 LOAD_FAST 0 (x)
^^ ^^
|| |+--- bytecode offset (in bytes from start of co_code)
|| +---- offset field
|+-------- ">>" marks this offset as a jump target
+--------- (no line number - same line as previous)
The offset is the byte position of this instruction within co_code. Since Python 3.6, each instruction is exactly 2 bytes (opcode byte + argument byte), so offsets increment by 2.
dis.get_instructions() - Structured Access
dis.get_instructions() returns an iterator of dis.Instruction named tuples - useful when writing tools that process bytecode programmatically:
import dis
def greet(name):
return f"Hello, {name}"
for instr in dis.get_instructions(greet):
print(
f"{instr.offset:4d} {instr.opname:<20s} "
f"arg={instr.argval!r:20} "
f"line={instr.starts_line}"
)
dis.Instruction fields:
opname: the instruction name as a stringopcode: the numeric opcodearg: the raw argument integerargval: the resolved argument (e.g., the actual variable name instead of the index)argrepr: human-readable representationoffset: byte offset inco_codestarts_line: source line number if this instruction starts a new line, elseNoneis_jump_target:Trueif this instruction is a jump target
dis.Bytecode - Object-Oriented Interface
dis.Bytecode wraps a callable and provides iterable, indexable, and printable access to its bytecode:
import dis
def compute(x, y):
return x ** 2 + y ** 2
bc = dis.Bytecode(compute)
print(bc.info()) # summary of the code object
print(bc.dis()) # same as dis.dis() but returned as a string
for instr in bc:
if instr.opname == "BINARY_OP":
print(f"Binary operation: {instr.argrepr} at offset {instr.offset}")
Part 2 - Key Opcodes Explained
Load and Store Opcodes
These opcodes move values between the frame's storage and the value stack:
| Opcode | What it does | Speed |
|---|---|---|
LOAD_FAST | Push a local variable (from co_varnames) onto the stack | Fast - direct array index |
STORE_FAST | Pop from stack; store in local variable array | Fast - direct array index |
LOAD_GLOBAL | Push a global name (look up in f_globals then f_builtins) | Slower - dict lookup |
STORE_GLOBAL | Pop from stack; store in f_globals | Slower - dict store |
LOAD_CONST | Push a constant (from co_consts) | Fast - direct array index |
LOAD_DEREF | Push a value from a cell (closure variable) | Medium - cell dereference |
STORE_DEREF | Pop from stack; store in a cell | Medium - cell dereference |
LOAD_ATTR | Pop object; push getattr(object, name) | Slowest - attribute lookup |
:::tip LOAD_FAST Is Faster Than LOAD_GLOBAL
LOAD_FAST is an indexed access into the frame's local variable array - essentially fastlocals[i]. LOAD_GLOBAL involves a dictionary hash lookup in f_globals, then potentially another in f_builtins.
For extremely hot loops that access a global function many thousands of times, pulling it into a local variable makes a measurable difference:
import math
def hot_loop_global(data):
return [math.sqrt(x) for x in data] # LOAD_GLOBAL math, LOAD_ATTR sqrt each iteration
def hot_loop_local(data):
_sqrt = math.sqrt # one LOAD_GLOBAL + LOAD_ATTR, then STORE_FAST
return [_sqrt(x) for x in data] # LOAD_FAST _sqrt each iteration
# Profile before optimising - only do this in measured hot paths
This is a micro-optimisation. Apply it only after profiling confirms it is the bottleneck. :::
Function Call Opcodes
In Python 3.10 and earlier:
CALL_FUNCTION n- call a function withnpositional argumentsCALL_FUNCTION_KW n- call with keyword argumentsCALL_FUNCTION_EX- call with*argsand**kwargsunpacking
In Python 3.11+:
PUSH_NULL- push aNULLmarker for the call protocolCALL n- unified call instruction replacing the above variantsPRECALL n- setup before CALL (3.11 only, removed in 3.12)
import dis
def caller():
return len([1, 2, 3])
dis.dis(caller)
# Python 3.12 output:
# LOAD_GLOBAL 1 (len + NULL) -- pushes len and NULL marker
# BUILD_LIST 0 -- builds []
# LOAD_CONST 1 ((1, 2, 3))
# LIST_EXTEND 1
# CALL 1 -- calls len([1, 2, 3])
# RETURN_VALUE
Build Opcodes
These create new collection objects:
| Opcode | What it builds |
|---|---|
BUILD_LIST n | Pop n items from stack; push a list |
BUILD_TUPLE n | Pop n items from stack; push a tuple |
BUILD_SET n | Pop n items from stack; push a set |
BUILD_MAP n | Pop 2n items (alternating key, value); push a dict |
BUILD_STRING n | Pop n strings; concatenate; push result |
FORMAT_VALUE | Format a value for an f-string (with optional format spec) |
import dis
def build_examples():
a = [1, 2, 3]
b = (4, 5)
c = {"x": 1}
dis.dis(build_examples)
# LOAD_CONST 1 (1)
# LOAD_CONST 2 (2)
# LOAD_CONST 3 (3)
# BUILD_LIST 3 -- pops 3, pushes [1, 2, 3]
# STORE_FAST 0 (a)
# ...
Iteration Opcodes
import dis
def sum_squares(items):
total = 0
for x in items:
total += x * x
return total
dis.dis(sum_squares)
Key iteration opcodes:
GET_ITER- callsiter()on the top-of-stack object; pushes the iteratorFOR_ITER n- callsnext()on the iterator; ifStopIteration, jump forward bynbytes (exit the loop); else push the value and continue- The loop body executes; at the end, an unconditional
JUMP_BACKWARDreturns toFOR_ITER
Arithmetic and Comparison Opcodes
In Python 3.12, arithmetic uses a single BINARY_OP instruction with an argument encoding the operation:
BINARY_OP argument | Operation |
|---|---|
| 0 | + |
| 1 | & |
| 2 | // |
| 3 | << |
| 4 | @ (matmul) |
| 5 | * |
| 6 | % |
| 7 | ` |
| 8 | ** |
| 9 | >> |
| 10 | - |
| 11 | / |
| 12 | ^ |
| 13 | += (in-place) |
| ... | (in-place variants) |
COMPARE_OP handles comparison operators (<, >, ==, !=, in, not in, is, is not).
Jump Opcodes
| Opcode | Behaviour |
|---|---|
POP_JUMP_IF_FALSE n | Pop; if falsy, jump to offset n |
POP_JUMP_IF_TRUE n | Pop; if truthy, jump to offset n |
JUMP_FORWARD n | Unconditional jump forward by n bytes |
JUMP_BACKWARD n | Unconditional jump backward (for loops) |
JUMP_IF_FALSE_OR_POP | Short-circuit and |
JUMP_IF_TRUE_OR_POP | Short-circuit or |
Part 3 - Value Stack Evolution
Step-by-Step Through a + b
The value stack is a LIFO stack of PyObject * pointers within the frame. Here is how LOAD_FAST a → LOAD_FAST b → BINARY_OP + evolves the stack:
Every opcode either pushes, pops, or both. The compiler statically computes the maximum stack depth (co_stacksize) so CPython can allocate the right amount of space in the frame.
Following the Mystery Function
Let's trace mystery(3) step by step:
def mystery(x):
return x * 2 if x > 0 else -x
dis output:
RESUME 0
LOAD_FAST 0 (x) # push x=3 stack: [3]
LOAD_CONST 1 (2) # push 2 stack: [3, 2]
BINARY_OP 5 (*) # pop 2,3; push 6 stack: [6]
LOAD_FAST 0 (x) # push x=3 stack: [6, 3]
LOAD_CONST 2 (0) # push 0 stack: [6, 3, 0]
COMPARE_OP 4 (>) # pop 0,3; 3>0=True; push True stack: [6, True]
POP_JUMP_IF_FALSE to 26 # pop True; True so do NOT jump stack: [6]
RETURN_VALUE # pop 6; return 6
Now trace mystery(-2):
RESUME 0
LOAD_FAST 0 (x) # push x=-2 stack: [-2]
LOAD_CONST 1 (2) # push 2 stack: [-2, 2]
BINARY_OP 5 (*) # pop 2,-2; push -4 stack: [-4]
LOAD_FAST 0 (x) # push x=-2 stack: [-4, -2]
LOAD_CONST 2 (0) # push 0 stack: [-4, -2, 0]
COMPARE_OP 4 (>) # pop 0,-2; -2>0=False; push False stack: [-4, False]
POP_JUMP_IF_FALSE to 26 # pop False; False so JUMP to 26 stack: [-4]
# jump to offset 26 - the stack still has -4 on it!
# but wait - JUMP_IF_FALSE jumped HERE:
>> 26:
LOAD_FAST 0 (x) # push x=-2 stack: [-4, -2]
UNARY_NEGATIVE # pop -2; push 2 stack: [-4, 2]
RETURN_VALUE # pop 2; return 2
Notice: the -4 value computed from x * 2 is still on the stack when we jump to the else branch. It is immediately overwritten - RETURN_VALUE only pops the top value and returns it. The -4 is "garbage" on the stack that CPython simply ignores because the frame is about to be destroyed. This is how CPython's compiler handles ternary expressions - it computes the "true" value first, then checks the condition, and if false, jumps to compute the "false" value and return that instead.
:::note Opcodes Changed Significantly in Python 3.11+
Python 3.11 introduced the "specialising adaptive interpreter": after a function is called enough times, CPython replaces generic opcodes with specialised ones. LOAD_GLOBAL might become LOAD_GLOBAL_MODULE (bypassing the builtins lookup). BINARY_OP for two integers might become BINARY_OP_ADD_INT. These specialisations are invisible to your Python code but make it faster. dis.dis() shows the original (non-specialised) opcodes. Python 3.12 extended this further with more specialised opcodes.
:::
Part 4 - Comparing Equivalent Python Patterns
x = x + 1 vs x += 1
import dis
def plus_assign(x):
x = x + 1
return x
def inplace(x):
x += 1
return x
print("=== x = x + 1 ===")
dis.dis(plus_assign)
print("=== x += 1 ===")
dis.dis(inplace)
Expected output pattern:
=== x = x + 1 ===
LOAD_FAST 0 (x)
LOAD_CONST 1 (1)
BINARY_OP 0 (+) # creates a new object
STORE_FAST 0 (x)
LOAD_FAST 0 (x)
RETURN_VALUE
=== x += 1 ===
LOAD_FAST 0 (x)
LOAD_CONST 1 (1)
BINARY_OP 13 (+=) # in-place if supported by the type
STORE_FAST 0 (x)
LOAD_FAST 0 (x)
RETURN_VALUE
For integers, += and + produce the same result because integers are immutable - there is no in-place operation. For mutable types like lists, += calls __iadd__ which modifies in place and is faster:
a = [1, 2, 3]
b = a
a += [4] # calls a.__iadd__([4]) - modifies a in place; b also sees the change
print(b) # [1, 2, 3, 4] - same object
a = [1, 2, 3]
b = a
a = a + [4] # creates a new list; a now points to new object; b unchanged
print(b) # [1, 2, 3]
List Comprehension vs for Loop
import dis
def list_comp(items):
return [x * 2 for x in items]
def for_loop(items):
result = []
for x in items:
result.append(x * 2)
return result
dis.dis(list_comp)
The key difference: list_comp shows a MAKE_FUNCTION call - the comprehension compiles to a hidden code object that runs in its own scope. The outer function creates this inner function and calls it with GET_ITER / CALL. The inner comprehension code object uses LIST_APPEND to build the list.
The for_loop version uses explicit LOAD_ATTR (to get result.append) and CALL on every iteration.
In practice, list comprehensions are faster than equivalent for loops for two reasons:
- The
LIST_APPENDopcode is a direct C call, bypassing Python attribute lookup - The comprehension body runs in a tight inner loop with no
LOAD_ATTRoverhead forappend
f-string vs str() vs .format()
import dis
name = "world"
def use_fstring():
return f"Hello, {name}"
def use_str():
return "Hello, " + str(name)
def use_format():
return "Hello, {}".format(name)
dis.dis(use_fstring)
dis.dis(use_str)
dis.dis(use_format)
F-strings compile to:
LOAD_GLOBAL name(orLOAD_FASTif local)FORMAT_VALUE- calls__format__on the valueBUILD_STRING n- concatenates n string pieces
str(name) compiles to:
LOAD_GLOBAL str- load thestrtypeLOAD_GLOBAL nameCALL 1- callstr(name)
"...".format(name) compiles to:
LOAD_CONST "Hello, {}"- load the format stringLOAD_ATTR format- attribute lookup on the stringLOAD_GLOBAL nameCALL 1
F-strings are fastest for simple substitutions because FORMAT_VALUE + BUILD_STRING avoids the overhead of a full CALL instruction. For complex cases (multiple conversions, nested expressions), the difference is negligible.
:::warning dis Output Varies Across Python Versions - Do Not Hardcode Opcode Numbers
Opcode names and numbers change between Python minor versions. Python 3.11 renamed and reorganised many opcodes. Python 3.12 added new specialised opcodes.
# DO NOT do this - will break on different Python versions:
assert instr.opcode == 90 # hardcoded opcode number
# DO this - use the name:
assert instr.opname == "LOAD_FAST"
# Or use the dis module's own mapping:
import dis
opcode_number = dis.opmap["LOAD_FAST"] # safe - looks up current version's mapping
:::
for Loop vs Generator Expression in sum()
import dis
data = range(10)
def sum_loop():
total = 0
for x in data:
total += x
return total
def sum_genexpr():
return sum(x for x in data)
def sum_direct():
return sum(data)
sum_direct is the fastest - it passes the iterable directly to the C implementation of sum. sum_genexpr is nearly identical because CPython has an optimisation: sum(genexpr) avoids creating the generator object when the generator expression is the sole argument. sum_loop is the slowest because it executes Python bytecode for every iteration.
Part 5 - Practical Uses of dis
Understanding Why Locals Beat Globals in Hot Loops
import dis
import math
def global_access():
result = 0
for i in range(1000):
result += math.sqrt(i) # LOAD_GLOBAL + LOAD_ATTR on every iteration
return result
def local_access():
sqrt = math.sqrt # one LOAD_GLOBAL + LOAD_ATTR, then STORE_FAST
result = 0
for i in range(1000):
result += sqrt(i) # LOAD_FAST on every iteration
return result
# Verify with dis:
dis.dis(global_access)
# Inner loop body includes: LOAD_GLOBAL math, LOAD_ATTR sqrt
dis.dis(local_access)
# Inner loop body includes: LOAD_FAST sqrt (much cheaper)
Spotting Unnecessary Attribute Lookups
import dis
class Processor:
def __init__(self):
self.count = 0
def process_slow(self, items):
for item in items:
self.count += 1 # LOAD_FAST self, LOAD_ATTR count - every iteration
self.count += item
def process_fast(self, items):
count = self.count # one attribute load
for item in items:
count += 1 # LOAD_FAST count - no attribute lookup
count += item
self.count = count # one attribute store
# dis shows the difference clearly:
dis.dis(Processor.process_slow) # LOAD_ATTR count appears in the loop
dis.dis(Processor.process_fast) # LOAD_ATTR only appears outside the loop
Using dis to Verify a Refactoring
import dis
# Before: multiple separate attribute lookups
def before(obj):
obj.x = obj.x + 1
obj.y = obj.y + 1
obj.z = obj.z + 1
# After: reading attributes once
def after(obj):
x, y, z = obj.x, obj.y, obj.z
obj.x = x + 1
obj.y = y + 1
obj.z = z + 1
# Use dis to count LOAD_ATTR and STORE_ATTR instructions:
def count_opname(func, name):
return sum(1 for i in dis.get_instructions(func) if i.opname == name)
print(f"before LOAD_ATTR: {count_opname(before, 'LOAD_ATTR')}") # 3
print(f"after LOAD_ATTR: {count_opname(after, 'LOAD_ATTR')}") # 3 (same - loads still needed)
print(f"before STORE_ATTR: {count_opname(before, 'STORE_ATTR')}") # 3
print(f"after STORE_ATTR: {count_opname(after, 'STORE_ATTR')}") # 3 (same)
# In this case, dis confirms the refactoring didn't reduce attribute ops -
# profile before optimising
:::danger Reading Bytecode Does Not Mean You Should Optimise at the Bytecode Level Understanding bytecode is a diagnostic tool, not an optimisation guide. The correct workflow is:
- Profile first - use
cProfile,line_profiler, orpy-spyto find the actual bottleneck - Identify the hot path - where does the program spend its time?
- Use
disto understand - confirm your mental model of what the hot path is doing - Optimise at the Python level - use better algorithms, data structures, or libraries (NumPy, etc.)
- Only reach for C extensions as a last resort - Cython, cffi, or ctypes
Micro-optimising bytecode for non-hot paths is premature optimisation. A function called once at startup is not worth byte-counting. A function called 10 million times in the inner loop is. :::
Part 6 - dis.Bytecode for Programmatic Analysis
For building tools that analyse bytecode, dis.Bytecode provides a clean programmatic interface:
import dis
def analyse_function(func):
"""Report opcodes used, sorted by frequency."""
from collections import Counter
bc = dis.Bytecode(func)
opcode_counts = Counter(instr.opname for instr in bc)
print(f"Function: {func.__name__}")
print(f"Total instructions: {sum(opcode_counts.values())}")
print("Opcode frequency:")
for opname, count in opcode_counts.most_common():
print(f" {opname:<25s} {count}")
print()
def complex_example(data):
result = []
for item in data:
if item > 0:
result.append(item * 2)
elif item < 0:
result.append(-item)
return result
analyse_function(complex_example)
Finding All Jump Targets
import dis
def find_branches(func):
"""Find all conditional branches in a function."""
branches = []
for instr in dis.get_instructions(func):
if "JUMP" in instr.opname or instr.opname in ("FOR_ITER",):
branches.append({
"offset": instr.offset,
"opname": instr.opname,
"target": instr.argval,
"line": instr.starts_line,
})
return branches
def example(x, items):
if x > 0:
for item in items:
if item:
return item
return None
for branch in find_branches(example):
print(branch)
Key Takeaways
dis.dis(func)prints human-readable disassembly;dis.get_instructions(func)returns structuredInstructionobjects;dis.Bytecode(func)provides an object-oriented interface- Each disassembly line shows: source line number (when it changes), bytecode offset, opcode name, argument, and a human-readable comment
- The
>>prefix marks a jump target - an offset that some other instruction jumps to LOAD_FAST(local variable) is an array index operation - significantly faster thanLOAD_GLOBAL(dict lookup) orLOAD_ATTR(attribute lookup)- The value stack is a LIFO stack; opcodes push, pop, or both;
co_stacksizeis the statically computed maximum depth x += 1compiles toBINARY_OP +=(in-place attempt);x = x + 1compiles toBINARY_OP +(new object); for integers (immutable) the result is identical; for lists (mutable)+=is faster- List comprehensions compile to a separate hidden code object with
LIST_APPEND; this is faster than explicitfor+appendbecause it avoids per-iteration attribute lookup - F-strings use
FORMAT_VALUE+BUILD_STRING- faster thanstr()or.format()for simple substitutions - Always use
instr.opname(string) rather thaninstr.opcode(number) when writing tools - opcode numbers change between Python versions - Profile before optimising - bytecode inspection is a diagnostic tool, not a guide to premature micro-optimisation
Graded Practice Challenges
Level 1 - Predict the Output
Question 1: How many LOAD_FAST instructions does this function contain?
import dis
def process(a, b, c):
x = a + b
y = x * c
return x + y
count = sum(1 for i in dis.get_instructions(process) if i.opname == "LOAD_FAST")
print(count)
Show Answer
Output: 5
Trace the uses of local variables: a (1), b (1), x (1), x again (1), c (1), y (1) - wait, that is 6. Let's be more precise:
x = a + b:LOAD_FAST a,LOAD_FAST b(2 loads)y = x * c:LOAD_FAST x,LOAD_FAST c(2 loads)return x + y:LOAD_FAST x,LOAD_FAST y(2 loads)
Total: 6 LOAD_FAST instructions.
(The exact count may vary slightly by Python version due to optimisations, but 6 is the expected count in CPython 3.10–3.12.)
Question 2: What is the difference in the dis output between these two functions?
import dis
def f1():
x = [1, 2, 3]
return x
def f2():
return [1, 2, 3]
dis.dis(f1)
dis.dis(f2)
Show Answer
f1 has a STORE_FAST x and then a LOAD_FAST x before RETURN_VALUE - one extra round-trip through the local variable store and load. f2 has no STORE_FAST/LOAD_FAST at all - the list goes directly from BUILD_LIST to RETURN_VALUE.
CPython does not eliminate the unnecessary store-and-load in f1 (it is not an optimising compiler in general). f2 is strictly shorter at the bytecode level. For a simple return like this, the compiler can often optimise it, but the explicit variable assignment prevents that optimisation.
Question 3: What does this print?
import dis
def short_circuit(a, b):
return a or b
instructions = list(dis.get_instructions(short_circuit))
jump_instrs = [i for i in instructions if "JUMP" in i.opname]
print(len(jump_instrs))
print(jump_instrs[0].opname)
Show Answer
Output:
1
JUMP_IF_TRUE_OR_POP
The or operator compiles to JUMP_IF_TRUE_OR_POP: if the left operand is truthy, jump (keeping it on the stack as the result); if falsy, pop it and evaluate the right operand. There is exactly one jump instruction. The opname is JUMP_IF_TRUE_OR_POP.
Question 4: True or False - a list comprehension and its equivalent for loop always produce the same bytecode?
def list_comp():
return [x for x in range(5)]
def for_loop():
result = []
for x in range(5):
result.append(x)
return result
Show Answer
False. They are semantically equivalent but produce different bytecode. The list comprehension compiles to a nested code object (a hidden <listcomp> function) that is created with MAKE_FUNCTION and called via CALL. It uses LIST_APPEND inside the inner code object. The for loop uses LOAD_ATTR to get list.append, then CALL on each iteration. They are different instruction sequences, and the comprehension version is generally faster due to the optimised LIST_APPEND opcode.
Question 5: What does this print?
import dis
def f(x):
if x:
return 1
return 2
has_two_returns = sum(
1 for i in dis.get_instructions(f) if i.opname == "RETURN_VALUE"
)
print(has_two_returns)
Show Answer
Output: 2
CPython generates a separate RETURN_VALUE instruction for each return path through the function. The if x: return 1 branch has one RETURN_VALUE, and the return 2 has another. Unlike some compilers that merge return paths into a single exit point, CPython generates one RETURN_VALUE per explicit return statement.
Level 2 - Debug Challenge
A developer uses dis to try to confirm that a performance optimisation worked. Find the flaw in their reasoning:
import dis
# Original version
def process_original(items):
result = []
for item in items:
result.append(item * 2)
return result
# "Optimised" version - developer claims it avoids attribute lookup
def process_optimised(items):
result = []
append = result.append # cache the bound method
for item in items:
append(item * 2)
return result
# Developer's analysis:
orig_attrs = sum(1 for i in dis.get_instructions(process_original) if i.opname == "LOAD_ATTR")
opt_attrs = sum(1 for i in dis.get_instructions(process_optimised) if i.opname == "LOAD_ATTR")
print(f"Original LOAD_ATTR count: {orig_attrs}") # prints 1 (for .append in loop)
print(f"Optimised LOAD_ATTR count: {opt_attrs}") # prints 1 (for .append in setup)
print("Optimisation saved:", orig_attrs - opt_attrs, "attribute lookups per call")
# prints "Optimisation saved: 0 attribute lookups per call"
# Developer concludes: "no improvement - not worth it"
Show Solution
The flaw: The developer is counting LOAD_ATTR instructions in the function definition, not in the loop body. The dis output shows the bytecode statically - it does not account for how many times each instruction executes at runtime.
In process_original, LOAD_ATTR append is inside the for loop - it executes once per item in items. In process_optimised, LOAD_ATTR append is outside the loop - it executes exactly once regardless of how many items there are.
Correct analysis - count instructions per loop iteration, not per function:
import dis
def instructions_in_loop(func):
"""Roughly identify which instructions are inside a for loop."""
instrs = list(dis.get_instructions(func))
# Find FOR_ITER (start of loop) and JUMP_BACKWARD (end of loop)
for_iter_offsets = [i.offset for i in instrs if i.opname == "FOR_ITER"]
jump_back_offsets = [i.offset for i in instrs if i.opname == "JUMP_BACKWARD"]
if not for_iter_offsets or not jump_back_offsets:
return []
loop_start = for_iter_offsets[0]
loop_end = jump_back_offsets[-1]
return [
i for i in instrs
if loop_start < i.offset <= loop_end
]
orig_loop = instructions_in_loop(process_original)
opt_loop = instructions_in_loop(process_optimised)
orig_load_attr = sum(1 for i in orig_loop if i.opname == "LOAD_ATTR")
opt_load_attr = sum(1 for i in opt_loop if i.opname == "LOAD_ATTR")
print(f"Original LOAD_ATTR per iteration: {orig_load_attr}") # 1
print(f"Optimised LOAD_ATTR per iteration: {opt_load_attr}") # 0
print(f"Optimisation saves {orig_load_attr - opt_load_attr} LOAD_ATTR per iteration")
# Optimisation saves 1 LOAD_ATTR per iteration - meaningful at large scale
The optimisation is real. For 1 million items, it saves 1 million LOAD_ATTR operations. The original analysis was wrong because it counted total instructions per function call, not per loop iteration.
Level 3 - Design Challenge
Design a BytecodeProfiler class that:
- Accepts any Python function
- Analyses the bytecode to classify instructions by category (loads, stores, calls, builds, jumps, returns, arithmetic)
- Identifies which instructions are inside loop bodies vs outside loops
- Produces a cost estimate by weighting instruction categories (e.g.,
LOAD_ATTRis more expensive thanLOAD_FAST) - Reports a human-readable summary with a hotspot warning if expensive instructions are inside loops
# Target usage:
def slow_func(items):
result = []
for item in items:
result.append(item.strip().upper())
return result
profiler = BytecodeProfiler(slow_func)
profiler.report()
# Function: slow_func
# Total instructions: N
# Instructions in loop body: M
# Estimated cost per iteration: K units
# HOTSPOT WARNING: LOAD_ATTR found inside loop body (3 occurrences)
# Consider caching: str.strip, str.upper, list.append
Show Reference Solution
import dis
from collections import defaultdict
# Relative cost weights per opcode category
INSTRUCTION_COSTS = {
"LOAD_FAST": 1,
"STORE_FAST": 1,
"LOAD_CONST": 1,
"LOAD_GLOBAL": 3,
"STORE_GLOBAL": 3,
"LOAD_ATTR": 5,
"STORE_ATTR": 5,
"LOAD_DEREF": 2,
"STORE_DEREF": 2,
"CALL": 10,
"BINARY_OP": 2,
"COMPARE_OP": 2,
"BUILD_LIST": 2,
"BUILD_TUPLE": 2,
"BUILD_DICT": 3,
"FOR_ITER": 3,
"GET_ITER": 2,
"JUMP_BACKWARD": 1,
"POP_JUMP_IF_FALSE": 1,
"POP_JUMP_IF_TRUE": 1,
"RETURN_VALUE": 1,
}
EXPENSIVE_IN_LOOP = {"LOAD_ATTR", "STORE_ATTR", "LOAD_GLOBAL", "CALL"}
class BytecodeProfiler:
def __init__(self, func):
self._func = func
self._instructions = list(dis.get_instructions(func))
def _find_loop_range(self):
"""Return (start_offset, end_offset) for the first for loop, or None."""
for_iter = next(
(i for i in self._instructions if i.opname == "FOR_ITER"), None
)
jump_back = next(
(i for i in reversed(self._instructions) if i.opname == "JUMP_BACKWARD"),
None,
)
if for_iter and jump_back:
return (for_iter.offset, jump_back.offset)
return None
def _classify(self, instr):
name = instr.opname
if name.startswith("LOAD"):
return "load"
if name.startswith("STORE"):
return "store"
if name in ("CALL", "CALL_FUNCTION", "CALL_FUNCTION_KW"):
return "call"
if name.startswith("BUILD"):
return "build"
if "JUMP" in name or name in ("FOR_ITER", "GET_ITER"):
return "control"
if name in ("BINARY_OP", "COMPARE_OP", "UNARY_NEGATIVE", "UNARY_NOT"):
return "arithmetic"
if name == "RETURN_VALUE":
return "return"
return "other"
def _cost(self, instr):
return INSTRUCTION_COSTS.get(instr.opname, 2)
def report(self):
loop_range = self._find_loop_range()
loop_instrs = []
outer_instrs = []
for instr in self._instructions:
if loop_range and loop_range[0] < instr.offset <= loop_range[1]:
loop_instrs.append(instr)
else:
outer_instrs.append(instr)
total = len(self._instructions)
loop_count = len(loop_instrs)
loop_cost = sum(self._cost(i) for i in loop_instrs)
categories = defaultdict(int)
for instr in self._instructions:
categories[self._classify(instr)] += 1
print(f"Function: {self._func.__name__}")
print(f"Total instructions: {total}")
print(f"Instructions in loop body: {loop_count}")
print(f"Estimated cost per loop iteration: {loop_cost} units")
print()
print("Instruction categories:")
for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
print(f" {cat:<12s} {count}")
# Hotspot warnings
if loop_instrs:
print()
hotspots = [
i for i in loop_instrs if i.opname in EXPENSIVE_IN_LOOP
]
if hotspots:
from collections import Counter
hotspot_counts = Counter(i.opname for i in hotspots)
print("HOTSPOT WARNING: Expensive operations inside loop body:")
for opname, count in hotspot_counts.most_common():
attrs = [
i.argrepr for i in hotspots
if i.opname == opname
]
print(f" {opname} ({count} occurrences): {', '.join(set(attrs))}")
print(" Consider caching attribute lookups and globals outside the loop.")
else:
print("No hotspot warnings - loop body looks clean.")
# Demo:
def slow_func(items):
result = []
for item in items:
result.append(item.strip().upper())
return result
profiler = BytecodeProfiler(slow_func)
profiler.report()
Key design decisions:
_find_loop_rangeidentifies the loop body by findingFOR_ITER(loop header) andJUMP_BACKWARD(loop tail) - instructions between these offsets are inside the loop- The cost weights in
INSTRUCTION_COSTSare rough relative heuristics, not profiled measurements - the tool is for directional guidance, not precise benchmarking EXPENSIVE_IN_LOOPflagsLOAD_ATTR,LOAD_GLOBAL, andCALLas worth investigating when found in loop bodies- The tool complements - it does not replace - a real profiler like
cProfileorpy-spy
What's Next
Lesson 04 covers The GIL Explained - what CPython's Global Interpreter Lock actually is, why it exists, what it protects, how it interacts with threads and I/O, when it matters in practice, and what Python 3.12+ is doing to weaken it. You have now seen the eval loop at the bytecode level. The GIL is what controls which thread gets to run that eval loop at any given moment.
