Skip to main content

List Comprehensions - Pythonic Iteration and Performance

Reading time: ~20 minutes | Level: Foundation → Engineering

Here is a question that exposes a common misconception about list comprehensions.

# Python 3 - variable scope inside comprehension
x = "outer"
result = [x for x in range(5)]
print(x) # What prints here?

If you expected 4, you have the Python 2 mental model. In Python 3, comprehensions have their own scope.

The actual output:

outer

The x inside the comprehension does not leak into the enclosing scope. This was a notorious Python 2 bug - fixed in Python 3 by giving comprehensions their own isolated scope.

Understanding list comprehensions at the bytecode level reveals why they are not just syntactic sugar, and why they perform differently from equivalent for loops.

What You Will Learn

  • How list comprehensions compile to a different bytecode than for loops - the LIST_APPEND vs CALL_METHOD difference
  • Performance benchmarks: list comprehensions vs for loops vs map() - with real timeit numbers
  • How nested comprehensions flatten matrix structures: [x for row in matrix for x in row]
  • Conditional filtering syntax and when to use if-else vs trailing if
  • Generator expressions vs list comprehensions: lazy vs eager evaluation, and when each is right
  • When NOT to use comprehensions - the readability threshold rule
  • Dict comprehensions {k: v for k, v in items} and set comprehensions {x for x in data}
  • Variable scoping in comprehensions and how Python 3 fixed the Python 2 scope leak
  • Memory comparison: sys.getsizeof on list comprehension vs generator vs loop result
  • Real-world patterns: data transformation pipelines, filtering, flattening

Prerequisites

  • Python variables and name binding
  • Basic Python for loop syntax
  • Familiarity with functions and conditionals
  • Optional: basic understanding of Big-O notation

Part 1 - What a List Comprehension Really Is

The Beginner Mental Model

Most beginners see a list comprehension as "a shorter for loop":

# "Equivalent" forms - are they really?
result_loop = []
for x in range(10):
result_loop.append(x * x)

result_comp = [x * x for x in range(10)]

They produce the same output - but they are not the same at the bytecode level.

What CPython Actually Does

Python compiles both forms to bytecode. Let's inspect what the bytecode looks like:

import dis

# Disassemble the for loop version
def using_loop():
result = []
for x in range(10):
result.append(x * x)
return result

# Disassemble the comprehension version
def using_comprehension():
return [x * x for x in range(10)]

print("=== FOR LOOP ===")
dis.dis(using_loop)

print("\n=== LIST COMPREHENSION ===")
dis.dis(using_comprehension)

The for loop version produces bytecode that:

  1. Looks up result.append on every iteration (attribute lookup)
  2. Calls CALL_METHOD to invoke append() on every iteration
  3. Stores the return value (which is None)

The list comprehension version produces bytecode that uses LIST_APPEND - a dedicated CPython opcode that directly appends to the list being built, without the attribute lookup or method call overhead.

FOR LOOP - per iteration cost:
LOAD_FAST (load 'result')
LOAD_ATTR (look up 'append' attribute)
LOAD_FAST (load 'x')
BINARY_OP (multiply)
CALL_FUNCTION
POP_TOP (discard None return)

LIST COMPREHENSION - per iteration cost:
LOAD_FAST (load 'x')
BINARY_OP (multiply)
LIST_APPEND (direct C-level append - no attribute lookup, no method call)

This is why comprehensions are consistently faster than equivalent for loops - not dramatically, but measurably.

Part 2 - Performance: Benchmarks with timeit

Let's measure this with real numbers instead of intuition.

import timeit

N = 1_000_000

# Method 1: For loop
def loop_version():
result = []
for x in range(N):
result.append(x * x)
return result

# Method 2: List comprehension
def comp_version():
return [x * x for x in range(N)]

# Method 3: map() with lambda
def map_lambda_version():
return list(map(lambda x: x * x, range(N)))

# Method 4: map() with built-in (no lambda overhead)
def map_builtin_version():
return list(map(pow, range(N), [2] * N))

# Benchmark each - 5 runs, take best
runs = 5
t_loop = timeit.timeit(loop_version, number=runs) / runs
t_comp = timeit.timeit(comp_version, number=runs) / runs
t_map_lambda = timeit.timeit(map_lambda_version, number=runs) / runs

print(f"For loop: {t_loop:.4f}s")
print(f"List comprehension:{t_comp:.4f}s")
print(f"map(lambda): {t_map_lambda:.4f}s")

Typical output (Python 3.11, M2 Mac):

For loop: 0.0891s
List comprehension:0.0612s
map(lambda): 0.0784s

Key findings:

  • List comprehension is approximately 25-35% faster than an equivalent for loop
  • map() with a lambda is often slower than list comprehension because the lambda call adds overhead
  • map() with a built-in function (like str, abs, int) can be faster than comprehensions because there is no Python-level function call at all
# map() with a built-in - potentially fastest for simple transforms
words = ["hello", "WORLD", "Python"]
lower = list(map(str.lower, words))
# No lambda, no Python function call overhead - str.lower is called at C level

:::tip When map() Wins map(str.lower, words) outperforms [w.lower() for w in words] because str.lower is a C-level method called directly without a Python stack frame. For simple single-method transforms, map with a built-in method is the fastest option. :::

Part 3 - Nested Comprehensions: Flattening and Building Matrices

Building a Matrix (Outer Comprehension = Rows)

# Build a 3x4 matrix
matrix = [[row * col for col in range(4)] for row in range(3)]
print(matrix)
# [[0, 0, 0, 0], [0, 1, 2, 3], [0, 2, 4, 6]]

The reading order for nested comprehensions is left to right, outer to inner:

[[expression for inner_var in inner_iter] for outer_var in outer_iter]
↑ inner loop ↑ outer loop (runs first)

Flattening a Matrix (Left to Right = Row-Major Order)

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Flatten - left to right means: outer for first, then inner for
flat = [x for row in matrix for x in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]

The reading order matches the equivalent nested loop:

# Equivalent nested loop - read in same order
flat = []
for row in matrix: # "for row in matrix" - appears first in comp
for x in row: # "for x in row" - appears second in comp
flat.append(x) # "x" - the expression, appears at the start

:::warning Nested Comprehension Reading Order The expression comes first, then the for clauses in order. This trips up many engineers: [x for row in matrix for x in row] - read it as "give me x, iterate over matrix to get rows, iterate over rows to get x". The outer loop comes first in the for-clause sequence. :::

Flattening Irregular Nested Lists

# Real-world: flatten a list of lists of varying length
data = [[1, 2], [3], [4, 5, 6], [7, 8]]
flat = [item for sublist in data for item in sublist]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8]

# Equivalent with itertools - often more readable for deep nesting
from itertools import chain
flat = list(chain.from_iterable(data))
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8]

Part 4 - Conditional Filtering

Trailing if (Filtering)

The trailing if filters which items are included. No else clause.

data = range(-5, 6)

# Keep only positives
positives = [x for x in data if x > 0]
print(positives) # [1, 2, 3, 4, 5]

# Multiple conditions - both must be true
divisible_by_2_and_3 = [x for x in range(30) if x % 2 == 0 if x % 3 == 0]
print(divisible_by_2_and_3) # [0, 6, 12, 18, 24]
# Note: multiple if clauses are AND - same as: if x % 2 == 0 and x % 3 == 0

Conditional Expression in Output (Ternary)

The ternary form value_if_true if condition else value_if_false appears in the expression position (at the start), not as a filter:

# Transform: positive stays, negative becomes 0
clamped = [x if x > 0 else 0 for x in range(-3, 4)]
print(clamped) # [0, 0, 0, 0, 1, 2, 3]

# Both forms combined: filter AND transform
# Keep only non-zero, then square them
result = [x * x for x in range(-5, 6) if x != 0]
print(result) # [25, 16, 9, 4, 1, 1, 4, 9, 16, 25]
Comprehension Anatomy:

[ expression for var in iterable if condition ]
↑ what to ↑ loop var ↑ source ↑ optional filter
produce (no else allowed here)

vs ternary in expression:

[ a if cond else b for var in iterable ]
↑ conditional value ↑ loop ↑ no trailing if filter

:::note Filter vs Transform [x for x in data if condition] - trailing if FILTERS (removes items). [x if condition else y for x in data] - ternary TRANSFORMS (changes value but keeps all items). These are fundamentally different. Do not confuse them. :::

Part 5 - Generator Expressions vs List Comprehensions

The Core Difference: Eager vs Lazy

import sys

# List comprehension - EAGER: builds entire list in memory NOW
list_comp = [x * x for x in range(1_000_000)]

# Generator expression - LAZY: builds nothing now, yields on demand
gen_expr = (x * x for x in range(1_000_000)) # Note: parentheses, not brackets

print(f"List comp size: {sys.getsizeof(list_comp):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_expr):,} bytes")

Output:

List comp size: 8,448,728 bytes (~8 MB for 1M integers)
Generator size: 200 bytes (just the generator object itself)

The generator uses 200 bytes regardless of how large the range is. The list comprehension uses memory proportional to n.

When to Use Each

# Use LIST COMPREHENSION when:
# 1. You need to iterate multiple times
# 2. You need random access (indexing)
# 3. You need len()
# 4. You need to pass to a function expecting a list

names = ["alice", "bob", "charlie"]
upper_names = [n.upper() for n in names] # Will be iterated multiple times
print(upper_names[0]) # Random access - need a list
print(len(upper_names)) # Need len() - generators don't have len()

# Use GENERATOR EXPRESSION when:
# 1. Single-pass iteration (sum, max, min, any, all)
# 2. Very large datasets (avoids memory spike)
# 3. Chaining transformations in a pipeline
# 4. Passing to functions that accept iterables

data = range(10_000_000)

# Generator is perfect here - sum iterates once, discards each value
total = sum(x * x for x in data) # No intermediate list created

# max, min, any, all - all work with generators
has_large = any(x > 9_999_990 for x in data) # Short-circuits early!

:::tip Short-Circuiting with any() and all() any(gen) stops as soon as it finds a True value. all(gen) stops at the first False. When using generators, this means the full sequence may never be exhausted. A list comprehension would evaluate everything first - wasting work. :::

Generator Expressions in Pipelines

# Multi-stage pipeline - each stage is a generator
# No stage materializes until the final consumer

raw_data = range(1_000_000)

# Stage 1: Filter out negatives (in real code: read from file or DB)
stage1 = (x for x in raw_data if x % 7 == 0)

# Stage 2: Transform
stage2 = (x * x for x in stage1)

# Stage 3: Additional filter
stage3 = (x for x in stage2 if x < 10_000)

# Only NOW does data flow through the pipeline
result = list(stage3) # Pulls data through all three stages

# Memory used at any moment: O(1) - just the current item being processed

Part 6 - Dict Comprehensions and Set Comprehensions

Dict Comprehensions

# Basic: {key_expr: value_expr for var in iterable}
squares_dict = {x: x * x for x in range(6)}
print(squares_dict)
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# Invert a dictionary (swap keys and values)
original = {"a": 1, "b": 2, "c": 3}
inverted = {v: k for k, v in original.items()}
print(inverted) # {1: 'a', 2: 'b', 3: 'c'}

# Filter while building
users = [
{"name": "Alice", "active": True, "score": 85},
{"name": "Bob", "active": False, "score": 92},
{"name": "Carol", "active": True, "score": 78},
]
active_scores = {u["name"]: u["score"] for u in users if u["active"]}
print(active_scores) # {'Alice': 85, 'Carol': 78}

# Normalize keys: lowercase all keys from an external API response
api_response = {"UserID": 42, "UserName": "alice", "Score": 99}
normalized = {k.lower(): v for k, v in api_response.items()}
print(normalized) # {'userid': 42, 'username': 'alice', 'score': 99}

Set Comprehensions

# Basic: {expr for var in iterable} - note: curly braces, no colon
unique_lengths = {len(word) for word in ["cat", "dog", "elephant", "ant", "bear"]}
print(unique_lengths) # {3, 8, 4} (order not guaranteed)

# Extract unique domains from email list
emails = [
]
domains = {email.split("@")[1] for email in emails}
print(domains) # {'gmail.com', 'yahoo.com', 'outlook.com'}

# Set comprehension for deduplication with transform
numbers = [1, -2, 3, -3, 2, -1, 4, -4]
unique_abs = {abs(n) for n in numbers}
print(unique_abs) # {1, 2, 3, 4}

:::note Empty Dict vs Empty Set {} creates an empty dict, not an empty set. Use set() to create an empty set. But {x for x in items} (with an expression) correctly creates a set comprehension. :::

Part 7 - Variable Scoping in Comprehensions

Python 3 Fixed the Python 2 Scope Leak

In Python 2, the loop variable in a list comprehension leaked into the enclosing scope:

# Python 2 behavior (DO NOT RELY ON THIS):
# x = "outer"
# result = [x for x in range(5)]
# print(x) # Would print: 4 (leaked!)

# Python 3 behavior (correct):
x = "outer"
result = [x for x in range(5)]
print(x) # "outer" - comprehension variable does NOT leak

Comprehensions in Python 3 create their own scope, implemented as a nested function in bytecode. The loop variable is local to that implicit function.

# Proof: the outer x is unchanged
x = 100
squares = [x * x for x in range(5)] # x is local to comprehension
print(x) # 100 - unchanged
print(squares) # [0, 1, 4, 9, 16]

# But the comprehension CAN close over outer variables
base = 10
result = [base + x for x in range(5)] # 'base' is captured from outer scope
print(result) # [10, 11, 12, 13, 14]

:::warning Generator Expressions and Scope Generator expressions also have their own scope. The iterable of the outermost for clause is evaluated immediately in the enclosing scope, but everything else (nested fors, conditions, expressions) is evaluated lazily inside the generator's scope. :::

# The outermost iterable is evaluated immediately
data = [1, 2, 3]
gen = (x * 2 for x in data) # 'data' is captured NOW
data = [10, 20, 30] # Rebinding data has no effect
print(list(gen)) # [2, 4, 6] - used original data

Part 8 - When NOT to Use Comprehensions

The Readability Threshold

Comprehensions express "produce a collection by transforming/filtering an iterable." When the transformation logic is complex, a comprehension becomes harder to read than a loop.

# GOOD: Simple, reads like English
evens = [x for x in data if x % 2 == 0]
names = [user.name for user in users]

# ACCEPTABLE: Slight complexity but still clear
active_emails = [u.email.lower() for u in users if u.is_active and u.email]

# BAD: Complexity has exceeded the readability threshold
result = [
process(item)
for sublist in nested_data
for item in sublist
if item.status == "valid"
if item.score > threshold
if not item.is_excluded
]

For that last example, a loop with meaningful variable names is clearer:

# Better as a loop - the complexity is exposed and manageable
result = []
for sublist in nested_data:
for item in sublist:
if item.status != "valid":
continue
if item.score <= threshold:
continue
if item.is_excluded:
continue
result.append(process(item))

Never Use Comprehensions for Side Effects

# WRONG: Using comprehension for side effects
[print(x) for x in range(5)] # Creates [None, None, None, None, None]
[db.save(record) for record in records] # Wasteful list creation

# RIGHT: Use a loop when the purpose is side effects
for x in range(5):
print(x)

for record in records:
db.save(record)

:::danger Side Effects in Comprehensions Using a list comprehension solely for its side effects (print, save, mutate) creates a throwaway list of None values. This wastes memory and communicates the wrong intent. Comprehensions are for building collections. Loops are for executing sequences of actions. :::

Part 9 - Memory Deep Dive: sys.getsizeof Comparison

import sys

N = 100_000

# Method 1: List comprehension
list_comp = [x * x for x in range(N)]
size_list = sys.getsizeof(list_comp)
# Plus size of each integer object - but small ints are cached

# Method 2: Generator (does NOT build anything)
gen_expr = (x * x for x in range(N))
size_gen = sys.getsizeof(gen_expr)

# Method 3: Traditional loop result
loop_result = []
for x in range(N):
loop_result.append(x * x)
size_loop = sys.getsizeof(loop_result)

print(f"List comprehension: {size_list:,} bytes")
print(f"Generator: {size_gen:,} bytes")
print(f"Loop result: {size_loop:,} bytes")

Output:

List comprehension: 800,984 bytes
Generator: 200 bytes
Loop result: 800,984 bytes

Key observations:

  • List comprehension and loop result use the same memory - they both produce a list
  • Generator uses constant memory (200 bytes) regardless of N
  • The list stores pointers (8 bytes each), so N pointers = 8N bytes + overhead
StructureMemory (N=100)Notes
List comprehension result~856+ bytes56 B struct + 100 pointers × 8 B + integer objects on heap
Generator expression~200 bytesFrame + state only - constant regardless of N

Part 10 - Real-World Data Transformation Patterns

Pattern 1: ETL Pipeline (Extract-Transform-Load)

# Raw API response data
raw_records = [
{"user_id": "U001", "name": " Alice ", "score": "85", "active": "true"},
{"user_id": "U002", "name": " Bob ", "score": "72", "active": "false"},
{"user_id": "U003", "name": " Carol ", "score": "91", "active": "true"},
{"user_id": "U004", "name": " Dave ", "score": "68", "active": "true"},
]

# Transform: clean, type-cast, and filter in one comprehension
clean_active_users = [
{
"user_id": r["user_id"],
"name": r["name"].strip(),
"score": int(r["score"]),
}
for r in raw_records
if r["active"] == "true"
]

print(clean_active_users)
# [
# {'user_id': 'U001', 'name': 'Alice', 'score': 85},
# {'user_id': 'U003', 'name': 'Carol', 'score': 91},
# {'user_id': 'U004', 'name': 'Dave', 'score': 68}
# ]

Pattern 2: Building an Index (Dict Comprehension)

# Build a lookup index: id → record
users = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
{"id": 3, "name": "Carol"},
]

user_index = {u["id"]: u for u in users}

# O(1) lookup now instead of O(n) scan
print(user_index[2]) # {'id': 2, 'name': 'Bob'}

Pattern 3: Frequency Count (Dict Comprehension + Counter)

from collections import Counter

words = ["apple", "banana", "apple", "cherry", "banana", "apple"]

# Using dict comprehension with Counter
freq = {word: count for word, count in Counter(words).items() if count > 1}
print(freq) # {'apple': 3, 'banana': 2}

# Or just keep top N
top_2 = {word: count for word, count in Counter(words).most_common(2)}
print(top_2) # {'apple': 3, 'banana': 2}

Pattern 4: Flattening Nested API Responses

# API returns nested: departments → employees
departments = [
{"dept": "Engineering", "employees": ["Alice", "Bob", "Carol"]},
{"dept": "Sales", "employees": ["Dave", "Eve"]},
{"dept": "HR", "employees": ["Frank"]},
]

# Flatten: all employee names
all_employees = [emp for dept in departments for emp in dept["employees"]]
print(all_employees)
# ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank']

# With metadata: keep dept association
employee_dept = [
(emp, dept["dept"])
for dept in departments
for emp in dept["employees"]
]
print(employee_dept)
# [('Alice', 'Engineering'), ('Bob', 'Engineering'), ...]

Interview Questions

Q1: Why are list comprehensions faster than equivalent for loops in CPython?

Answer: List comprehensions use the LIST_APPEND bytecode opcode, which is a direct C-level list append that avoids the overhead of attribute lookup (result.append), method binding, and the Python function call machinery (CALL_METHOD). A for loop using result.append() must perform these steps on every iteration. Benchmarks typically show comprehensions are 25-35% faster than equivalent loops. However, for trivial transforms the difference narrows, and map() with a built-in (non-lambda) function can sometimes outperform both.

Q2: What is the difference between [x for x in data if x > 0] and [x if x > 0 else 0 for x in data]?

Answer: The first is a filter - items where x > 0 is False are excluded from the output list. The result may be shorter than the input. The second is a conditional expression (ternary) in the output position - every item from data produces an output, but the value is either x (if x > 0) or 0 (otherwise). The result has the same length as the input. The trailing if filters; the ternary if-else in the expression position transforms.

Q3: When should you use a generator expression instead of a list comprehension?

Answer: Use a generator expression when: (1) you only need to iterate through the results once - e.g., sum(x*x for x in data), (2) the dataset is large and you want to avoid materializing all results in memory, (3) you are building a lazy pipeline and want data to flow through stages on demand. Use a list comprehension when: you need to iterate multiple times, access by index, call len(), or pass to a function that requires an actual list. Generators use O(1) memory; list comprehensions use O(n) memory.

Q4: How does variable scoping inside comprehensions work in Python 3?

Answer: In Python 3, comprehensions have their own enclosing scope - the loop variable does not leak into the surrounding scope. CPython implements this by compiling the comprehension as a nested function that is immediately invoked. The loop variable is local to that implicit function. However, the comprehension CAN read (close over) variables from the enclosing scope - it just cannot write back to them through the loop variable. This was a well-known Python 2 bug where the loop variable leaked out: x = "outer"; [x for x in range(5)]; print(x) printed 4 in Python 2 and "outer" in Python 3.

Q5: What is the performance difference between map(str.lower, words) and [w.lower() for w in words]?

Answer: map(str.lower, words) is often faster because str.lower is a C-level method descriptor called directly by map's C implementation - no Python function call stack frame is created per element. The list comprehension must create a Python-level call to w.lower() each iteration. The difference matters at scale: for 1M elements, map(str.lower, words) can be 30-50% faster. However, if you need a lambda (e.g., map(lambda w: w.lower()[:5], words)), the lambda overhead often makes it slower than the equivalent comprehension. Use map with built-in methods; use comprehensions when you need Python-level expressions.

Q6: What happens if you use a list comprehension for side effects like [db.save(r) for r in records]?

Answer: This creates a list of return values from db.save(r) - if save() returns None (common for write operations), you get [None, None, None, ...]. This wastes memory allocating a list you immediately discard. It also communicates the wrong intent: comprehensions signal "I am building a collection", not "I am performing actions". The correct form is a plain for loop. Linters like pylint and flake8 will flag list comprehensions used only for side effects.

Practice Challenges

Beginner - Predict the Output

What does this code print?

x = 50
result = [x for x in range(5)]
print(x)
print(result)
Solution

Output:

50
[0, 1, 2, 3, 4]

In Python 3, the x inside the comprehension is scoped to the comprehension - it does not overwrite the outer x = 50. The outer x remains 50 after the comprehension executes. This is fundamentally different from Python 2 behavior where the loop variable would have leaked and overwritten the outer x.

Intermediate - Refactor for Correctness and Performance

The following code has a correctness issue and a performance issue. Identify and fix both.

def get_unique_active_user_domains(users):
"""users: list of dicts with 'email' and 'active' keys"""
domains = []
for user in users:
if user["active"]:
domain = user["email"].split("@")[1]
if domain not in domains: # <-- issue here
domains.append(domain)
return domains

users = [
{"email": "[email protected]", "active": True},
{"email": "[email protected]", "active": True},
{"email": "[email protected]", "active": True},
{"email": "[email protected]", "active": False},
{"email": "[email protected]", "active": True},
]
Solution

Issues:

  1. Performance: if domain not in domains is O(n) on a list - as domains grows, this check takes longer. For 10,000 unique domains, this is 10,000 × 10,000 / 2 = 50M comparisons. Use a set for O(1) membership.
  2. Ordering: Using a set loses insertion order (though in Python 3.7+ dicts preserve order).

Fixed version using set comprehension:

def get_unique_active_user_domains(users):
return {
user["email"].split("@")[1]
for user in users
if user["active"]
}

result = get_unique_active_user_domains(users)
print(result) # {'gmail.com', 'yahoo.com'}

The set comprehension automatically deduplicates and is O(n) total - one pass, O(1) per membership check (handled implicitly by set semantics).

If order matters (insertion-order-preserving unique list):

def get_unique_active_user_domains_ordered(users):
seen = set()
return [
domain
for user in users
if user["active"]
for domain in [user["email"].split("@")[1]] # walrus trick without walrus
if domain not in seen and not seen.add(domain)
]

Or more clearly with the walrus operator (Python 3.8+):

def get_unique_active_user_domains_ordered(users):
seen = set()
return [
domain
for user in users
if user["active"]
if (domain := user["email"].split("@")[1]) not in seen
if not seen.add(domain)
]

Or simply the most readable loop:

def get_unique_active_user_domains_ordered(users):
seen = set()
result = []
for user in users:
if not user["active"]:
continue
domain = user["email"].split("@")[1]
if domain not in seen:
seen.add(domain)
result.append(domain)
return result

Advanced - Build a Multi-Stage Data Pipeline

You are processing log entries from a web server. Each log entry is a string in Apache Combined Log Format. You need to extract a frequency distribution of HTTP status codes, but only for requests that:

  1. Are not from the internal network (IP does not start with 10. or 192.168.)
  2. Took longer than 200ms (field at the end)
  3. Came from a real browser (User-Agent contains "Mozilla")

Design this as a generator-based pipeline and a dict comprehension for the final aggregation.

import re
from collections import Counter

# Sample log lines (simplified)
log_lines = [
'10.0.0.1 - - [01/Mar/2026] "GET /api" 200 1234 "Mozilla/5.0" 150ms',
'203.0.113.1 - - [01/Mar/2026] "GET /home" 200 5678 "Mozilla/5.0" 350ms',
'203.0.113.2 - - [01/Mar/2026] "POST /api" 404 890 "curl/7.68" 80ms',
'192.168.1.5 - - [01/Mar/2026] "GET /admin" 403 123 "Mozilla/5.0" 500ms',
'198.51.100.3 - - [01/Mar/2026] "GET /api" 500 456 "Mozilla/5.0" 620ms',
'203.0.113.3 - - [01/Mar/2026] "GET /api" 200 789 "Mozilla/5.0" 250ms',
'203.0.113.4 - - [01/Mar/2026] "GET /api" 429 321 "Mozilla/5.0" 210ms',
]
Solution
import re
from collections import Counter

log_lines = [
'10.0.0.1 - - [01/Mar/2026] "GET /api" 200 1234 "Mozilla/5.0" 150ms',
'203.0.113.1 - - [01/Mar/2026] "GET /home" 200 5678 "Mozilla/5.0" 350ms',
'203.0.113.2 - - [01/Mar/2026] "POST /api" 404 890 "curl/7.68" 80ms',
'192.168.1.5 - - [01/Mar/2026] "GET /admin" 403 123 "Mozilla/5.0" 500ms',
'198.51.100.3 - - [01/Mar/2026] "GET /api" 500 456 "Mozilla/5.0" 620ms',
'203.0.113.3 - - [01/Mar/2026] "GET /api" 200 789 "Mozilla/5.0" 250ms',
'203.0.113.4 - - [01/Mar/2026] "GET /api" 429 321 "Mozilla/5.0" 210ms',
]

LOG_PATTERN = re.compile(
r'^(?P<ip>\S+).*?"(?:\w+) \S+ \S+" (?P<status>\d{3}) \d+ "(?P<ua>[^"]+)" (?P<ms>\d+)ms$'
)

def parse_log(line):
"""Returns a dict if line matches, else None."""
m = LOG_PATTERN.match(line)
if not m:
return None
return {
"ip": m.group("ip"),
"status": int(m.group("status")),
"ua": m.group("ua"),
"ms": int(m.group("ms")),
}

def is_internal_ip(ip):
return ip.startswith("10.") or ip.startswith("192.168.")

# Stage 1: Parse - generator, no memory allocation until consumed
parsed = (parse_log(line) for line in log_lines)

# Stage 2: Drop unparseable lines
valid = (record for record in parsed if record is not None)

# Stage 3: Filter by rules - external IP, slow response, browser UA
filtered = (
record
for record in valid
if not is_internal_ip(record["ip"])
if record["ms"] > 200
if "Mozilla" in record["ua"]
)

# Stage 4: Consume - build status frequency dict
# All of stages 1-3 execute lazily here in one pass
status_counts = dict(Counter(record["status"] for record in filtered))

print(status_counts)
# {200: 2, 500: 1, 429: 1}

# Dict comprehension form for the final result (percentage breakdown)
total = sum(status_counts.values())
status_pct = {
status: f"{count/total*100:.1f}%"
for status, count in sorted(status_counts.items())
}
print(status_pct)
# {200: '50.0%', 429: '25.0%', 500: '25.0%'}

Design analysis:

  • Stages 1-3 are generators - zero memory allocated until stage 4 pulls data through
  • All filtering and transformation happens in a single pass - O(n) time, O(1) intermediate memory
  • The Counter at stage 4 is the only point where a data structure is materialized
  • The dict comprehension at the end is O(k) where k = number of unique status codes (tiny)
  • is_internal_ip extracted as a function - makes the generator expression readable

Quick Reference

FormSyntaxReturnsEagernessUse Case
List comprehension[expr for x in it]listEagerNeed list, multiple iteration
Filtered list comp[expr for x in it if cond]listEagerFilter + transform
Ternary in comp[a if cond else b for x in it]listEagerTransform all, vary output
Generator expression(expr for x in it)generatorLazySingle pass, large data
Dict comprehension{k: v for k, v in it}dictEagerBuild lookup tables
Set comprehension{expr for x in it}setEagerUnique values, O(1) lookup
Nested flatten[x for row in m for x in row]listEagerFlatten 2D → 1D
Nested build[[expr for j in cols] for i in rows]list of listsEagerBuild matrix
OperationComprehensionmap()loop
Simple transformFastFaster (built-in)Slowest
Lambda transformFastSlower (lambda overhead)Slowest
Filter + transformFastAwkwardClear
Side effectsWrong toolWrong toolCorrect tool
Memory for N=1M~8 MB~8 MB (after list())~8 MB
Memory lazyUse generatormap() is lazyAccumulate manually

Key Takeaways

  • List comprehensions use the LIST_APPEND opcode - avoiding attribute lookup and method call overhead - making them 25-35% faster than equivalent for loops
  • map() with a built-in function (not a lambda) can outperform comprehensions because no Python function call is needed per element
  • Nested comprehension for-clauses read left-to-right matching equivalent nested loops: [x for row in matrix for x in row] = outer loop first
  • Generator expressions (expr for x in it) use O(1) memory regardless of input size - prefer them for sum, any, all, and large single-pass transformations
  • In Python 3, comprehension loop variables are scoped to the comprehension and do not leak - a critical fix over Python 2
  • Dict comprehensions {k: v for k, v in items} and set comprehensions {x for x in data} are the Pythonic way to build mappings and unique collections
  • Never use a comprehension for side effects - use a for loop; comprehensions are for producing collections
  • When logic exceeds approximately three conditions or two levels of nesting, switch to an explicit for loop for readability
© 2026 EngineersOfAI. All rights reserved.