Skip to main content

Disassembly with dis - Reading CPython Bytecode

Reading time: ~40 minutes | Level: Intermediate → Engineering

Before reading further, predict what dis.dis(mystery) will print, and what mystery(3) and mystery(-2) will return:

import dis

def mystery(x):
return x * 2 if x > 0 else -x

dis.dis(mystery)

Write out the opcode sequence you expect to see, in order.

Almost no one gets this right without having read bytecode before. The actual dis.dis output (Python 3.12):

2 RESUME 0

3 LOAD_FAST 0 (x)
LOAD_CONST 1 (2)
BINARY_OP 5 (*)
LOAD_FAST 0 (x)
LOAD_CONST 2 (0)
COMPARE_OP 4 (>)
POP_JUMP_IF_FALSE 3 (to 26)
RETURN_VALUE

>> 26 LOAD_FAST 0 (x)
UNARY_NEGATIVE
RETURN_VALUE

mystery(3) returns 6. mystery(-2) returns 2.

Three things should surprise you here. First: the conditional expression evaluates the "true" branch first, then jumps over it if the condition is false - it is backward from how you read it in source. Second: x * 2 is computed before the x > 0 check, not after. Third: there are two RETURN_VALUE instructions - CPython generates one per branch rather than one shared exit point.

Once you understand why, you understand something about how CPython's compiler works. That is what this lesson builds.

What You Will Learn

  • The dis.dis(), dis.disassemble(), and dis.get_instructions() API
  • How to read disassembly output: offset, line number, opcode name, argument, comment
  • The key opcodes and what they do on the value stack
  • Value stack evolution step by step through a real example
  • How equivalent Python patterns compile differently (or identically)
  • Practical performance insights from bytecode comparison
  • The dis.Bytecode object for structured programmatic access

Prerequisites

  • Lesson 01: CPython Architecture (the eval loop, the value stack)
  • Lesson 02: Bytecode Inspection (the code object and its attributes)

Part 1 - The dis Module API

dis.dis() - Human-Readable Disassembly

dis.dis() disassembles a function, method, class, module, string of source, or bytes-like object and prints the result:

import dis

def add(a, b):
return a + b

dis.dis(add)

Output:

2 RESUME 0

3 LOAD_FAST 0 (a)
LOAD_FAST 1 (b)
BINARY_OP 0 (+)
RETURN_VALUE

Reading the Output Format

Each line has up to five fields:

3 LOAD_FAST 0 (a)
^ ^ ^ ^
| | | +-- human-readable comment (variable name, const value)
| | +---- opcode argument (index or value)
| +----------------------------- opcode name
+----------------------------------------- source line number (only shown at first instruction per line)

There is also an optional >> prefix indicating a jump target:

>> 26 LOAD_FAST 0 (x)
^^ ^^
|| |+--- bytecode offset (in bytes from start of co_code)
|| +---- offset field
|+-------- ">>" marks this offset as a jump target
+--------- (no line number - same line as previous)

The offset is the byte position of this instruction within co_code. Since Python 3.6, each instruction is exactly 2 bytes (opcode byte + argument byte), so offsets increment by 2.

dis.get_instructions() - Structured Access

dis.get_instructions() returns an iterator of dis.Instruction named tuples - useful when writing tools that process bytecode programmatically:

import dis

def greet(name):
return f"Hello, {name}"

for instr in dis.get_instructions(greet):
print(
f"{instr.offset:4d} {instr.opname:<20s} "
f"arg={instr.argval!r:20} "
f"line={instr.starts_line}"
)

dis.Instruction fields:

  • opname: the instruction name as a string
  • opcode: the numeric opcode
  • arg: the raw argument integer
  • argval: the resolved argument (e.g., the actual variable name instead of the index)
  • argrepr: human-readable representation
  • offset: byte offset in co_code
  • starts_line: source line number if this instruction starts a new line, else None
  • is_jump_target: True if this instruction is a jump target

dis.Bytecode - Object-Oriented Interface

dis.Bytecode wraps a callable and provides iterable, indexable, and printable access to its bytecode:

import dis

def compute(x, y):
return x ** 2 + y ** 2

bc = dis.Bytecode(compute)
print(bc.info()) # summary of the code object
print(bc.dis()) # same as dis.dis() but returned as a string

for instr in bc:
if instr.opname == "BINARY_OP":
print(f"Binary operation: {instr.argrepr} at offset {instr.offset}")

Part 2 - Key Opcodes Explained

Load and Store Opcodes

These opcodes move values between the frame's storage and the value stack:

OpcodeWhat it doesSpeed
LOAD_FASTPush a local variable (from co_varnames) onto the stackFast - direct array index
STORE_FASTPop from stack; store in local variable arrayFast - direct array index
LOAD_GLOBALPush a global name (look up in f_globals then f_builtins)Slower - dict lookup
STORE_GLOBALPop from stack; store in f_globalsSlower - dict store
LOAD_CONSTPush a constant (from co_consts)Fast - direct array index
LOAD_DEREFPush a value from a cell (closure variable)Medium - cell dereference
STORE_DEREFPop from stack; store in a cellMedium - cell dereference
LOAD_ATTRPop object; push getattr(object, name)Slowest - attribute lookup

:::tip LOAD_FAST Is Faster Than LOAD_GLOBAL LOAD_FAST is an indexed access into the frame's local variable array - essentially fastlocals[i]. LOAD_GLOBAL involves a dictionary hash lookup in f_globals, then potentially another in f_builtins.

For extremely hot loops that access a global function many thousands of times, pulling it into a local variable makes a measurable difference:

import math

def hot_loop_global(data):
return [math.sqrt(x) for x in data] # LOAD_GLOBAL math, LOAD_ATTR sqrt each iteration

def hot_loop_local(data):
_sqrt = math.sqrt # one LOAD_GLOBAL + LOAD_ATTR, then STORE_FAST
return [_sqrt(x) for x in data] # LOAD_FAST _sqrt each iteration

# Profile before optimising - only do this in measured hot paths

This is a micro-optimisation. Apply it only after profiling confirms it is the bottleneck. :::

Function Call Opcodes

In Python 3.10 and earlier:

  • CALL_FUNCTION n - call a function with n positional arguments
  • CALL_FUNCTION_KW n - call with keyword arguments
  • CALL_FUNCTION_EX - call with *args and **kwargs unpacking

In Python 3.11+:

  • PUSH_NULL - push a NULL marker for the call protocol
  • CALL n - unified call instruction replacing the above variants
  • PRECALL n - setup before CALL (3.11 only, removed in 3.12)
import dis

def caller():
return len([1, 2, 3])

dis.dis(caller)
# Python 3.12 output:
# LOAD_GLOBAL 1 (len + NULL) -- pushes len and NULL marker
# BUILD_LIST 0 -- builds []
# LOAD_CONST 1 ((1, 2, 3))
# LIST_EXTEND 1
# CALL 1 -- calls len([1, 2, 3])
# RETURN_VALUE

Build Opcodes

These create new collection objects:

OpcodeWhat it builds
BUILD_LIST nPop n items from stack; push a list
BUILD_TUPLE nPop n items from stack; push a tuple
BUILD_SET nPop n items from stack; push a set
BUILD_MAP nPop 2n items (alternating key, value); push a dict
BUILD_STRING nPop n strings; concatenate; push result
FORMAT_VALUEFormat a value for an f-string (with optional format spec)
import dis

def build_examples():
a = [1, 2, 3]
b = (4, 5)
c = {"x": 1}

dis.dis(build_examples)
# LOAD_CONST 1 (1)
# LOAD_CONST 2 (2)
# LOAD_CONST 3 (3)
# BUILD_LIST 3 -- pops 3, pushes [1, 2, 3]
# STORE_FAST 0 (a)
# ...

Iteration Opcodes

import dis

def sum_squares(items):
total = 0
for x in items:
total += x * x
return total

dis.dis(sum_squares)

Key iteration opcodes:

  • GET_ITER - calls iter() on the top-of-stack object; pushes the iterator
  • FOR_ITER n - calls next() on the iterator; if StopIteration, jump forward by n bytes (exit the loop); else push the value and continue
  • The loop body executes; at the end, an unconditional JUMP_BACKWARD returns to FOR_ITER

Arithmetic and Comparison Opcodes

In Python 3.12, arithmetic uses a single BINARY_OP instruction with an argument encoding the operation:

BINARY_OP argumentOperation
0+
1&
2//
3<<
4@ (matmul)
5*
6%
7`
8**
9>>
10-
11/
12^
13+= (in-place)
...(in-place variants)

COMPARE_OP handles comparison operators (<, >, ==, !=, in, not in, is, is not).

Jump Opcodes

OpcodeBehaviour
POP_JUMP_IF_FALSE nPop; if falsy, jump to offset n
POP_JUMP_IF_TRUE nPop; if truthy, jump to offset n
JUMP_FORWARD nUnconditional jump forward by n bytes
JUMP_BACKWARD nUnconditional jump backward (for loops)
JUMP_IF_FALSE_OR_POPShort-circuit and
JUMP_IF_TRUE_OR_POPShort-circuit or

Part 3 - Value Stack Evolution

Step-by-Step Through a + b

The value stack is a LIFO stack of PyObject * pointers within the frame. Here is how LOAD_FAST a → LOAD_FAST b → BINARY_OP + evolves the stack:

Every opcode either pushes, pops, or both. The compiler statically computes the maximum stack depth (co_stacksize) so CPython can allocate the right amount of space in the frame.

Following the Mystery Function

Let's trace mystery(3) step by step:

def mystery(x):
return x * 2 if x > 0 else -x

dis output:
RESUME 0
LOAD_FAST 0 (x) # push x=3 stack: [3]
LOAD_CONST 1 (2) # push 2 stack: [3, 2]
BINARY_OP 5 (*) # pop 2,3; push 6 stack: [6]
LOAD_FAST 0 (x) # push x=3 stack: [6, 3]
LOAD_CONST 2 (0) # push 0 stack: [6, 3, 0]
COMPARE_OP 4 (>) # pop 0,3; 3>0=True; push True stack: [6, True]
POP_JUMP_IF_FALSE to 26 # pop True; True so do NOT jump stack: [6]
RETURN_VALUE # pop 6; return 6

Now trace mystery(-2):

RESUME 0
LOAD_FAST 0 (x) # push x=-2 stack: [-2]
LOAD_CONST 1 (2) # push 2 stack: [-2, 2]
BINARY_OP 5 (*) # pop 2,-2; push -4 stack: [-4]
LOAD_FAST 0 (x) # push x=-2 stack: [-4, -2]
LOAD_CONST 2 (0) # push 0 stack: [-4, -2, 0]
COMPARE_OP 4 (>) # pop 0,-2; -2>0=False; push False stack: [-4, False]
POP_JUMP_IF_FALSE to 26 # pop False; False so JUMP to 26 stack: [-4]
# jump to offset 26 - the stack still has -4 on it!
# but wait - JUMP_IF_FALSE jumped HERE:
>> 26:
LOAD_FAST 0 (x) # push x=-2 stack: [-4, -2]
UNARY_NEGATIVE # pop -2; push 2 stack: [-4, 2]
RETURN_VALUE # pop 2; return 2

Notice: the -4 value computed from x * 2 is still on the stack when we jump to the else branch. It is immediately overwritten - RETURN_VALUE only pops the top value and returns it. The -4 is "garbage" on the stack that CPython simply ignores because the frame is about to be destroyed. This is how CPython's compiler handles ternary expressions - it computes the "true" value first, then checks the condition, and if false, jumps to compute the "false" value and return that instead.

:::note Opcodes Changed Significantly in Python 3.11+ Python 3.11 introduced the "specialising adaptive interpreter": after a function is called enough times, CPython replaces generic opcodes with specialised ones. LOAD_GLOBAL might become LOAD_GLOBAL_MODULE (bypassing the builtins lookup). BINARY_OP for two integers might become BINARY_OP_ADD_INT. These specialisations are invisible to your Python code but make it faster. dis.dis() shows the original (non-specialised) opcodes. Python 3.12 extended this further with more specialised opcodes. :::

Part 4 - Comparing Equivalent Python Patterns

x = x + 1 vs x += 1

import dis

def plus_assign(x):
x = x + 1
return x

def inplace(x):
x += 1
return x

print("=== x = x + 1 ===")
dis.dis(plus_assign)
print("=== x += 1 ===")
dis.dis(inplace)

Expected output pattern:

=== x = x + 1 ===
LOAD_FAST 0 (x)
LOAD_CONST 1 (1)
BINARY_OP 0 (+) # creates a new object
STORE_FAST 0 (x)
LOAD_FAST 0 (x)
RETURN_VALUE

=== x += 1 ===
LOAD_FAST 0 (x)
LOAD_CONST 1 (1)
BINARY_OP 13 (+=) # in-place if supported by the type
STORE_FAST 0 (x)
LOAD_FAST 0 (x)
RETURN_VALUE

For integers, += and + produce the same result because integers are immutable - there is no in-place operation. For mutable types like lists, += calls __iadd__ which modifies in place and is faster:

a = [1, 2, 3]
b = a
a += [4] # calls a.__iadd__([4]) - modifies a in place; b also sees the change
print(b) # [1, 2, 3, 4] - same object

a = [1, 2, 3]
b = a
a = a + [4] # creates a new list; a now points to new object; b unchanged
print(b) # [1, 2, 3]

List Comprehension vs for Loop

import dis

def list_comp(items):
return [x * 2 for x in items]

def for_loop(items):
result = []
for x in items:
result.append(x * 2)
return result

dis.dis(list_comp)

The key difference: list_comp shows a MAKE_FUNCTION call - the comprehension compiles to a hidden code object that runs in its own scope. The outer function creates this inner function and calls it with GET_ITER / CALL. The inner comprehension code object uses LIST_APPEND to build the list.

The for_loop version uses explicit LOAD_ATTR (to get result.append) and CALL on every iteration.

In practice, list comprehensions are faster than equivalent for loops for two reasons:

  1. The LIST_APPEND opcode is a direct C call, bypassing Python attribute lookup
  2. The comprehension body runs in a tight inner loop with no LOAD_ATTR overhead for append

f-string vs str() vs .format()

import dis

name = "world"

def use_fstring():
return f"Hello, {name}"

def use_str():
return "Hello, " + str(name)

def use_format():
return "Hello, {}".format(name)

dis.dis(use_fstring)
dis.dis(use_str)
dis.dis(use_format)

F-strings compile to:

  • LOAD_GLOBAL name (or LOAD_FAST if local)
  • FORMAT_VALUE - calls __format__ on the value
  • BUILD_STRING n - concatenates n string pieces

str(name) compiles to:

  • LOAD_GLOBAL str - load the str type
  • LOAD_GLOBAL name
  • CALL 1 - call str(name)

"...".format(name) compiles to:

  • LOAD_CONST "Hello, {}" - load the format string
  • LOAD_ATTR format - attribute lookup on the string
  • LOAD_GLOBAL name
  • CALL 1

F-strings are fastest for simple substitutions because FORMAT_VALUE + BUILD_STRING avoids the overhead of a full CALL instruction. For complex cases (multiple conversions, nested expressions), the difference is negligible.

:::warning dis Output Varies Across Python Versions - Do Not Hardcode Opcode Numbers Opcode names and numbers change between Python minor versions. Python 3.11 renamed and reorganised many opcodes. Python 3.12 added new specialised opcodes.

# DO NOT do this - will break on different Python versions:
assert instr.opcode == 90 # hardcoded opcode number

# DO this - use the name:
assert instr.opname == "LOAD_FAST"

# Or use the dis module's own mapping:
import dis
opcode_number = dis.opmap["LOAD_FAST"] # safe - looks up current version's mapping

:::

for Loop vs Generator Expression in sum()

import dis

data = range(10)

def sum_loop():
total = 0
for x in data:
total += x
return total

def sum_genexpr():
return sum(x for x in data)

def sum_direct():
return sum(data)

sum_direct is the fastest - it passes the iterable directly to the C implementation of sum. sum_genexpr is nearly identical because CPython has an optimisation: sum(genexpr) avoids creating the generator object when the generator expression is the sole argument. sum_loop is the slowest because it executes Python bytecode for every iteration.

Part 5 - Practical Uses of dis

Understanding Why Locals Beat Globals in Hot Loops

import dis
import math

def global_access():
result = 0
for i in range(1000):
result += math.sqrt(i) # LOAD_GLOBAL + LOAD_ATTR on every iteration
return result

def local_access():
sqrt = math.sqrt # one LOAD_GLOBAL + LOAD_ATTR, then STORE_FAST
result = 0
for i in range(1000):
result += sqrt(i) # LOAD_FAST on every iteration
return result

# Verify with dis:
dis.dis(global_access)
# Inner loop body includes: LOAD_GLOBAL math, LOAD_ATTR sqrt

dis.dis(local_access)
# Inner loop body includes: LOAD_FAST sqrt (much cheaper)

Spotting Unnecessary Attribute Lookups

import dis

class Processor:
def __init__(self):
self.count = 0

def process_slow(self, items):
for item in items:
self.count += 1 # LOAD_FAST self, LOAD_ATTR count - every iteration
self.count += item

def process_fast(self, items):
count = self.count # one attribute load
for item in items:
count += 1 # LOAD_FAST count - no attribute lookup
count += item
self.count = count # one attribute store

# dis shows the difference clearly:
dis.dis(Processor.process_slow) # LOAD_ATTR count appears in the loop
dis.dis(Processor.process_fast) # LOAD_ATTR only appears outside the loop

Using dis to Verify a Refactoring

import dis

# Before: multiple separate attribute lookups
def before(obj):
obj.x = obj.x + 1
obj.y = obj.y + 1
obj.z = obj.z + 1

# After: reading attributes once
def after(obj):
x, y, z = obj.x, obj.y, obj.z
obj.x = x + 1
obj.y = y + 1
obj.z = z + 1

# Use dis to count LOAD_ATTR and STORE_ATTR instructions:
def count_opname(func, name):
return sum(1 for i in dis.get_instructions(func) if i.opname == name)

print(f"before LOAD_ATTR: {count_opname(before, 'LOAD_ATTR')}") # 3
print(f"after LOAD_ATTR: {count_opname(after, 'LOAD_ATTR')}") # 3 (same - loads still needed)
print(f"before STORE_ATTR: {count_opname(before, 'STORE_ATTR')}") # 3
print(f"after STORE_ATTR: {count_opname(after, 'STORE_ATTR')}") # 3 (same)
# In this case, dis confirms the refactoring didn't reduce attribute ops -
# profile before optimising

:::danger Reading Bytecode Does Not Mean You Should Optimise at the Bytecode Level Understanding bytecode is a diagnostic tool, not an optimisation guide. The correct workflow is:

  1. Profile first - use cProfile, line_profiler, or py-spy to find the actual bottleneck
  2. Identify the hot path - where does the program spend its time?
  3. Use dis to understand - confirm your mental model of what the hot path is doing
  4. Optimise at the Python level - use better algorithms, data structures, or libraries (NumPy, etc.)
  5. Only reach for C extensions as a last resort - Cython, cffi, or ctypes

Micro-optimising bytecode for non-hot paths is premature optimisation. A function called once at startup is not worth byte-counting. A function called 10 million times in the inner loop is. :::

Part 6 - dis.Bytecode for Programmatic Analysis

For building tools that analyse bytecode, dis.Bytecode provides a clean programmatic interface:

import dis

def analyse_function(func):
"""Report opcodes used, sorted by frequency."""
from collections import Counter

bc = dis.Bytecode(func)
opcode_counts = Counter(instr.opname for instr in bc)

print(f"Function: {func.__name__}")
print(f"Total instructions: {sum(opcode_counts.values())}")
print("Opcode frequency:")
for opname, count in opcode_counts.most_common():
print(f" {opname:<25s} {count}")
print()

def complex_example(data):
result = []
for item in data:
if item > 0:
result.append(item * 2)
elif item < 0:
result.append(-item)
return result

analyse_function(complex_example)

Finding All Jump Targets

import dis

def find_branches(func):
"""Find all conditional branches in a function."""
branches = []
for instr in dis.get_instructions(func):
if "JUMP" in instr.opname or instr.opname in ("FOR_ITER",):
branches.append({
"offset": instr.offset,
"opname": instr.opname,
"target": instr.argval,
"line": instr.starts_line,
})
return branches

def example(x, items):
if x > 0:
for item in items:
if item:
return item
return None

for branch in find_branches(example):
print(branch)

Key Takeaways

  • dis.dis(func) prints human-readable disassembly; dis.get_instructions(func) returns structured Instruction objects; dis.Bytecode(func) provides an object-oriented interface
  • Each disassembly line shows: source line number (when it changes), bytecode offset, opcode name, argument, and a human-readable comment
  • The >> prefix marks a jump target - an offset that some other instruction jumps to
  • LOAD_FAST (local variable) is an array index operation - significantly faster than LOAD_GLOBAL (dict lookup) or LOAD_ATTR (attribute lookup)
  • The value stack is a LIFO stack; opcodes push, pop, or both; co_stacksize is the statically computed maximum depth
  • x += 1 compiles to BINARY_OP += (in-place attempt); x = x + 1 compiles to BINARY_OP + (new object); for integers (immutable) the result is identical; for lists (mutable) += is faster
  • List comprehensions compile to a separate hidden code object with LIST_APPEND; this is faster than explicit for + append because it avoids per-iteration attribute lookup
  • F-strings use FORMAT_VALUE + BUILD_STRING - faster than str() or .format() for simple substitutions
  • Always use instr.opname (string) rather than instr.opcode (number) when writing tools - opcode numbers change between Python versions
  • Profile before optimising - bytecode inspection is a diagnostic tool, not a guide to premature micro-optimisation

Graded Practice Challenges

Level 1 - Predict the Output

Question 1: How many LOAD_FAST instructions does this function contain?

import dis

def process(a, b, c):
x = a + b
y = x * c
return x + y

count = sum(1 for i in dis.get_instructions(process) if i.opname == "LOAD_FAST")
print(count)
Show Answer

Output: 5

Trace the uses of local variables: a (1), b (1), x (1), x again (1), c (1), y (1) - wait, that is 6. Let's be more precise:

  • x = a + b: LOAD_FAST a, LOAD_FAST b (2 loads)
  • y = x * c: LOAD_FAST x, LOAD_FAST c (2 loads)
  • return x + y: LOAD_FAST x, LOAD_FAST y (2 loads)

Total: 6 LOAD_FAST instructions.

(The exact count may vary slightly by Python version due to optimisations, but 6 is the expected count in CPython 3.10–3.12.)

Question 2: What is the difference in the dis output between these two functions?

import dis

def f1():
x = [1, 2, 3]
return x

def f2():
return [1, 2, 3]

dis.dis(f1)
dis.dis(f2)
Show Answer

f1 has a STORE_FAST x and then a LOAD_FAST x before RETURN_VALUE - one extra round-trip through the local variable store and load. f2 has no STORE_FAST/LOAD_FAST at all - the list goes directly from BUILD_LIST to RETURN_VALUE.

CPython does not eliminate the unnecessary store-and-load in f1 (it is not an optimising compiler in general). f2 is strictly shorter at the bytecode level. For a simple return like this, the compiler can often optimise it, but the explicit variable assignment prevents that optimisation.

Question 3: What does this print?

import dis

def short_circuit(a, b):
return a or b

instructions = list(dis.get_instructions(short_circuit))
jump_instrs = [i for i in instructions if "JUMP" in i.opname]
print(len(jump_instrs))
print(jump_instrs[0].opname)
Show Answer

Output:

1
JUMP_IF_TRUE_OR_POP

The or operator compiles to JUMP_IF_TRUE_OR_POP: if the left operand is truthy, jump (keeping it on the stack as the result); if falsy, pop it and evaluate the right operand. There is exactly one jump instruction. The opname is JUMP_IF_TRUE_OR_POP.

Question 4: True or False - a list comprehension and its equivalent for loop always produce the same bytecode?

def list_comp():
return [x for x in range(5)]

def for_loop():
result = []
for x in range(5):
result.append(x)
return result
Show Answer

False. They are semantically equivalent but produce different bytecode. The list comprehension compiles to a nested code object (a hidden <listcomp> function) that is created with MAKE_FUNCTION and called via CALL. It uses LIST_APPEND inside the inner code object. The for loop uses LOAD_ATTR to get list.append, then CALL on each iteration. They are different instruction sequences, and the comprehension version is generally faster due to the optimised LIST_APPEND opcode.

Question 5: What does this print?

import dis

def f(x):
if x:
return 1
return 2

has_two_returns = sum(
1 for i in dis.get_instructions(f) if i.opname == "RETURN_VALUE"
)
print(has_two_returns)
Show Answer

Output: 2

CPython generates a separate RETURN_VALUE instruction for each return path through the function. The if x: return 1 branch has one RETURN_VALUE, and the return 2 has another. Unlike some compilers that merge return paths into a single exit point, CPython generates one RETURN_VALUE per explicit return statement.

Level 2 - Debug Challenge

A developer uses dis to try to confirm that a performance optimisation worked. Find the flaw in their reasoning:

import dis

# Original version
def process_original(items):
result = []
for item in items:
result.append(item * 2)
return result

# "Optimised" version - developer claims it avoids attribute lookup
def process_optimised(items):
result = []
append = result.append # cache the bound method
for item in items:
append(item * 2)
return result

# Developer's analysis:
orig_attrs = sum(1 for i in dis.get_instructions(process_original) if i.opname == "LOAD_ATTR")
opt_attrs = sum(1 for i in dis.get_instructions(process_optimised) if i.opname == "LOAD_ATTR")

print(f"Original LOAD_ATTR count: {orig_attrs}") # prints 1 (for .append in loop)
print(f"Optimised LOAD_ATTR count: {opt_attrs}") # prints 1 (for .append in setup)
print("Optimisation saved:", orig_attrs - opt_attrs, "attribute lookups per call")
# prints "Optimisation saved: 0 attribute lookups per call"
# Developer concludes: "no improvement - not worth it"
Show Solution

The flaw: The developer is counting LOAD_ATTR instructions in the function definition, not in the loop body. The dis output shows the bytecode statically - it does not account for how many times each instruction executes at runtime.

In process_original, LOAD_ATTR append is inside the for loop - it executes once per item in items. In process_optimised, LOAD_ATTR append is outside the loop - it executes exactly once regardless of how many items there are.

Correct analysis - count instructions per loop iteration, not per function:

import dis

def instructions_in_loop(func):
"""Roughly identify which instructions are inside a for loop."""
instrs = list(dis.get_instructions(func))
# Find FOR_ITER (start of loop) and JUMP_BACKWARD (end of loop)
for_iter_offsets = [i.offset for i in instrs if i.opname == "FOR_ITER"]
jump_back_offsets = [i.offset for i in instrs if i.opname == "JUMP_BACKWARD"]

if not for_iter_offsets or not jump_back_offsets:
return []

loop_start = for_iter_offsets[0]
loop_end = jump_back_offsets[-1]

return [
i for i in instrs
if loop_start < i.offset <= loop_end
]

orig_loop = instructions_in_loop(process_original)
opt_loop = instructions_in_loop(process_optimised)

orig_load_attr = sum(1 for i in orig_loop if i.opname == "LOAD_ATTR")
opt_load_attr = sum(1 for i in opt_loop if i.opname == "LOAD_ATTR")

print(f"Original LOAD_ATTR per iteration: {orig_load_attr}") # 1
print(f"Optimised LOAD_ATTR per iteration: {opt_load_attr}") # 0
print(f"Optimisation saves {orig_load_attr - opt_load_attr} LOAD_ATTR per iteration")
# Optimisation saves 1 LOAD_ATTR per iteration - meaningful at large scale

The optimisation is real. For 1 million items, it saves 1 million LOAD_ATTR operations. The original analysis was wrong because it counted total instructions per function call, not per loop iteration.

Level 3 - Design Challenge

Design a BytecodeProfiler class that:

  1. Accepts any Python function
  2. Analyses the bytecode to classify instructions by category (loads, stores, calls, builds, jumps, returns, arithmetic)
  3. Identifies which instructions are inside loop bodies vs outside loops
  4. Produces a cost estimate by weighting instruction categories (e.g., LOAD_ATTR is more expensive than LOAD_FAST)
  5. Reports a human-readable summary with a hotspot warning if expensive instructions are inside loops
# Target usage:
def slow_func(items):
result = []
for item in items:
result.append(item.strip().upper())
return result

profiler = BytecodeProfiler(slow_func)
profiler.report()
# Function: slow_func
# Total instructions: N
# Instructions in loop body: M
# Estimated cost per iteration: K units
# HOTSPOT WARNING: LOAD_ATTR found inside loop body (3 occurrences)
# Consider caching: str.strip, str.upper, list.append
Show Reference Solution
import dis
from collections import defaultdict


# Relative cost weights per opcode category
INSTRUCTION_COSTS = {
"LOAD_FAST": 1,
"STORE_FAST": 1,
"LOAD_CONST": 1,
"LOAD_GLOBAL": 3,
"STORE_GLOBAL": 3,
"LOAD_ATTR": 5,
"STORE_ATTR": 5,
"LOAD_DEREF": 2,
"STORE_DEREF": 2,
"CALL": 10,
"BINARY_OP": 2,
"COMPARE_OP": 2,
"BUILD_LIST": 2,
"BUILD_TUPLE": 2,
"BUILD_DICT": 3,
"FOR_ITER": 3,
"GET_ITER": 2,
"JUMP_BACKWARD": 1,
"POP_JUMP_IF_FALSE": 1,
"POP_JUMP_IF_TRUE": 1,
"RETURN_VALUE": 1,
}

EXPENSIVE_IN_LOOP = {"LOAD_ATTR", "STORE_ATTR", "LOAD_GLOBAL", "CALL"}


class BytecodeProfiler:
def __init__(self, func):
self._func = func
self._instructions = list(dis.get_instructions(func))

def _find_loop_range(self):
"""Return (start_offset, end_offset) for the first for loop, or None."""
for_iter = next(
(i for i in self._instructions if i.opname == "FOR_ITER"), None
)
jump_back = next(
(i for i in reversed(self._instructions) if i.opname == "JUMP_BACKWARD"),
None,
)
if for_iter and jump_back:
return (for_iter.offset, jump_back.offset)
return None

def _classify(self, instr):
name = instr.opname
if name.startswith("LOAD"):
return "load"
if name.startswith("STORE"):
return "store"
if name in ("CALL", "CALL_FUNCTION", "CALL_FUNCTION_KW"):
return "call"
if name.startswith("BUILD"):
return "build"
if "JUMP" in name or name in ("FOR_ITER", "GET_ITER"):
return "control"
if name in ("BINARY_OP", "COMPARE_OP", "UNARY_NEGATIVE", "UNARY_NOT"):
return "arithmetic"
if name == "RETURN_VALUE":
return "return"
return "other"

def _cost(self, instr):
return INSTRUCTION_COSTS.get(instr.opname, 2)

def report(self):
loop_range = self._find_loop_range()
loop_instrs = []
outer_instrs = []

for instr in self._instructions:
if loop_range and loop_range[0] < instr.offset <= loop_range[1]:
loop_instrs.append(instr)
else:
outer_instrs.append(instr)

total = len(self._instructions)
loop_count = len(loop_instrs)
loop_cost = sum(self._cost(i) for i in loop_instrs)

categories = defaultdict(int)
for instr in self._instructions:
categories[self._classify(instr)] += 1

print(f"Function: {self._func.__name__}")
print(f"Total instructions: {total}")
print(f"Instructions in loop body: {loop_count}")
print(f"Estimated cost per loop iteration: {loop_cost} units")
print()
print("Instruction categories:")
for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
print(f" {cat:<12s} {count}")

# Hotspot warnings
if loop_instrs:
print()
hotspots = [
i for i in loop_instrs if i.opname in EXPENSIVE_IN_LOOP
]
if hotspots:
from collections import Counter
hotspot_counts = Counter(i.opname for i in hotspots)
print("HOTSPOT WARNING: Expensive operations inside loop body:")
for opname, count in hotspot_counts.most_common():
attrs = [
i.argrepr for i in hotspots
if i.opname == opname
]
print(f" {opname} ({count} occurrences): {', '.join(set(attrs))}")
print(" Consider caching attribute lookups and globals outside the loop.")
else:
print("No hotspot warnings - loop body looks clean.")


# Demo:
def slow_func(items):
result = []
for item in items:
result.append(item.strip().upper())
return result

profiler = BytecodeProfiler(slow_func)
profiler.report()

Key design decisions:

  • _find_loop_range identifies the loop body by finding FOR_ITER (loop header) and JUMP_BACKWARD (loop tail) - instructions between these offsets are inside the loop
  • The cost weights in INSTRUCTION_COSTS are rough relative heuristics, not profiled measurements - the tool is for directional guidance, not precise benchmarking
  • EXPENSIVE_IN_LOOP flags LOAD_ATTR, LOAD_GLOBAL, and CALL as worth investigating when found in loop bodies
  • The tool complements - it does not replace - a real profiler like cProfile or py-spy

What's Next

Lesson 04 covers The GIL Explained - what CPython's Global Interpreter Lock actually is, why it exists, what it protects, how it interacts with threads and I/O, when it matters in practice, and what Python 3.12+ is doing to weaken it. You have now seen the eval loop at the bytecode level. The GIL is what controls which thread gets to run that eval loop at any given moment.

© 2026 EngineersOfAI. All rights reserved.