List Comprehensions - Pythonic Iteration and Performance
Reading time: ~20 minutes | Level: Foundation → Engineering
Here is a question that exposes a common misconception about list comprehensions.
# Python 3 - variable scope inside comprehension
x = "outer"
result = [x for x in range(5)]
print(x) # What prints here?
If you expected 4, you have the Python 2 mental model. In Python 3, comprehensions have their own scope.
The actual output:
outer
The x inside the comprehension does not leak into the enclosing scope. This was a notorious Python 2 bug - fixed in Python 3 by giving comprehensions their own isolated scope.
Understanding list comprehensions at the bytecode level reveals why they are not just syntactic sugar, and why they perform differently from equivalent for loops.
What You Will Learn
- How list comprehensions compile to a different bytecode than
forloops - theLIST_APPENDvsCALL_METHODdifference - Performance benchmarks: list comprehensions vs
forloops vsmap()- with realtimeitnumbers - How nested comprehensions flatten matrix structures:
[x for row in matrix for x in row] - Conditional filtering syntax and when to use
if-elsevs trailingif - Generator expressions vs list comprehensions: lazy vs eager evaluation, and when each is right
- When NOT to use comprehensions - the readability threshold rule
- Dict comprehensions
{k: v for k, v in items}and set comprehensions{x for x in data} - Variable scoping in comprehensions and how Python 3 fixed the Python 2 scope leak
- Memory comparison:
sys.getsizeofon list comprehension vs generator vs loop result - Real-world patterns: data transformation pipelines, filtering, flattening
Prerequisites
- Python variables and name binding
- Basic Python
forloop syntax - Familiarity with functions and conditionals
- Optional: basic understanding of Big-O notation
Part 1 - What a List Comprehension Really Is
The Beginner Mental Model
Most beginners see a list comprehension as "a shorter for loop":
# "Equivalent" forms - are they really?
result_loop = []
for x in range(10):
result_loop.append(x * x)
result_comp = [x * x for x in range(10)]
They produce the same output - but they are not the same at the bytecode level.
What CPython Actually Does
Python compiles both forms to bytecode. Let's inspect what the bytecode looks like:
import dis
# Disassemble the for loop version
def using_loop():
result = []
for x in range(10):
result.append(x * x)
return result
# Disassemble the comprehension version
def using_comprehension():
return [x * x for x in range(10)]
print("=== FOR LOOP ===")
dis.dis(using_loop)
print("\n=== LIST COMPREHENSION ===")
dis.dis(using_comprehension)
The for loop version produces bytecode that:
- Looks up
result.appendon every iteration (attribute lookup) - Calls
CALL_METHODto invokeappend()on every iteration - Stores the return value (which is
None)
The list comprehension version produces bytecode that uses LIST_APPEND - a dedicated CPython opcode that directly appends to the list being built, without the attribute lookup or method call overhead.
FOR LOOP - per iteration cost:
LOAD_FAST (load 'result')
LOAD_ATTR (look up 'append' attribute)
LOAD_FAST (load 'x')
BINARY_OP (multiply)
CALL_FUNCTION
POP_TOP (discard None return)
LIST COMPREHENSION - per iteration cost:
LOAD_FAST (load 'x')
BINARY_OP (multiply)
LIST_APPEND (direct C-level append - no attribute lookup, no method call)
This is why comprehensions are consistently faster than equivalent for loops - not dramatically, but measurably.
Part 2 - Performance: Benchmarks with timeit
Let's measure this with real numbers instead of intuition.
import timeit
N = 1_000_000
# Method 1: For loop
def loop_version():
result = []
for x in range(N):
result.append(x * x)
return result
# Method 2: List comprehension
def comp_version():
return [x * x for x in range(N)]
# Method 3: map() with lambda
def map_lambda_version():
return list(map(lambda x: x * x, range(N)))
# Method 4: map() with built-in (no lambda overhead)
def map_builtin_version():
return list(map(pow, range(N), [2] * N))
# Benchmark each - 5 runs, take best
runs = 5
t_loop = timeit.timeit(loop_version, number=runs) / runs
t_comp = timeit.timeit(comp_version, number=runs) / runs
t_map_lambda = timeit.timeit(map_lambda_version, number=runs) / runs
print(f"For loop: {t_loop:.4f}s")
print(f"List comprehension:{t_comp:.4f}s")
print(f"map(lambda): {t_map_lambda:.4f}s")
Typical output (Python 3.11, M2 Mac):
For loop: 0.0891s
List comprehension:0.0612s
map(lambda): 0.0784s
Key findings:
- List comprehension is approximately 25-35% faster than an equivalent
forloop map()with a lambda is often slower than list comprehension because the lambda call adds overheadmap()with a built-in function (likestr,abs,int) can be faster than comprehensions because there is no Python-level function call at all
# map() with a built-in - potentially fastest for simple transforms
words = ["hello", "WORLD", "Python"]
lower = list(map(str.lower, words))
# No lambda, no Python function call overhead - str.lower is called at C level
:::tip When map() Wins
map(str.lower, words) outperforms [w.lower() for w in words] because str.lower is a C-level method called directly without a Python stack frame. For simple single-method transforms, map with a built-in method is the fastest option.
:::
Part 3 - Nested Comprehensions: Flattening and Building Matrices
Building a Matrix (Outer Comprehension = Rows)
# Build a 3x4 matrix
matrix = [[row * col for col in range(4)] for row in range(3)]
print(matrix)
# [[0, 0, 0, 0], [0, 1, 2, 3], [0, 2, 4, 6]]
The reading order for nested comprehensions is left to right, outer to inner:
[[expression for inner_var in inner_iter] for outer_var in outer_iter]
↑ inner loop ↑ outer loop (runs first)
Flattening a Matrix (Left to Right = Row-Major Order)
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
# Flatten - left to right means: outer for first, then inner for
flat = [x for row in matrix for x in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
The reading order matches the equivalent nested loop:
# Equivalent nested loop - read in same order
flat = []
for row in matrix: # "for row in matrix" - appears first in comp
for x in row: # "for x in row" - appears second in comp
flat.append(x) # "x" - the expression, appears at the start
:::warning Nested Comprehension Reading Order
The expression comes first, then the for clauses in order. This trips up many engineers: [x for row in matrix for x in row] - read it as "give me x, iterate over matrix to get rows, iterate over rows to get x". The outer loop comes first in the for-clause sequence.
:::
Flattening Irregular Nested Lists
# Real-world: flatten a list of lists of varying length
data = [[1, 2], [3], [4, 5, 6], [7, 8]]
flat = [item for sublist in data for item in sublist]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8]
# Equivalent with itertools - often more readable for deep nesting
from itertools import chain
flat = list(chain.from_iterable(data))
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8]
Part 4 - Conditional Filtering
Trailing if (Filtering)
The trailing if filters which items are included. No else clause.
data = range(-5, 6)
# Keep only positives
positives = [x for x in data if x > 0]
print(positives) # [1, 2, 3, 4, 5]
# Multiple conditions - both must be true
divisible_by_2_and_3 = [x for x in range(30) if x % 2 == 0 if x % 3 == 0]
print(divisible_by_2_and_3) # [0, 6, 12, 18, 24]
# Note: multiple if clauses are AND - same as: if x % 2 == 0 and x % 3 == 0
Conditional Expression in Output (Ternary)
The ternary form value_if_true if condition else value_if_false appears in the expression position (at the start), not as a filter:
# Transform: positive stays, negative becomes 0
clamped = [x if x > 0 else 0 for x in range(-3, 4)]
print(clamped) # [0, 0, 0, 0, 1, 2, 3]
# Both forms combined: filter AND transform
# Keep only non-zero, then square them
result = [x * x for x in range(-5, 6) if x != 0]
print(result) # [25, 16, 9, 4, 1, 1, 4, 9, 16, 25]
Comprehension Anatomy:
[ expression for var in iterable if condition ]
↑ what to ↑ loop var ↑ source ↑ optional filter
produce (no else allowed here)
vs ternary in expression:
[ a if cond else b for var in iterable ]
↑ conditional value ↑ loop ↑ no trailing if filter
:::note Filter vs Transform
[x for x in data if condition] - trailing if FILTERS (removes items).
[x if condition else y for x in data] - ternary TRANSFORMS (changes value but keeps all items).
These are fundamentally different. Do not confuse them.
:::
Part 5 - Generator Expressions vs List Comprehensions
The Core Difference: Eager vs Lazy
import sys
# List comprehension - EAGER: builds entire list in memory NOW
list_comp = [x * x for x in range(1_000_000)]
# Generator expression - LAZY: builds nothing now, yields on demand
gen_expr = (x * x for x in range(1_000_000)) # Note: parentheses, not brackets
print(f"List comp size: {sys.getsizeof(list_comp):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_expr):,} bytes")
Output:
List comp size: 8,448,728 bytes (~8 MB for 1M integers)
Generator size: 200 bytes (just the generator object itself)
The generator uses 200 bytes regardless of how large the range is. The list comprehension uses memory proportional to n.
When to Use Each
# Use LIST COMPREHENSION when:
# 1. You need to iterate multiple times
# 2. You need random access (indexing)
# 3. You need len()
# 4. You need to pass to a function expecting a list
names = ["alice", "bob", "charlie"]
upper_names = [n.upper() for n in names] # Will be iterated multiple times
print(upper_names[0]) # Random access - need a list
print(len(upper_names)) # Need len() - generators don't have len()
# Use GENERATOR EXPRESSION when:
# 1. Single-pass iteration (sum, max, min, any, all)
# 2. Very large datasets (avoids memory spike)
# 3. Chaining transformations in a pipeline
# 4. Passing to functions that accept iterables
data = range(10_000_000)
# Generator is perfect here - sum iterates once, discards each value
total = sum(x * x for x in data) # No intermediate list created
# max, min, any, all - all work with generators
has_large = any(x > 9_999_990 for x in data) # Short-circuits early!
:::tip Short-Circuiting with any() and all()
any(gen) stops as soon as it finds a True value. all(gen) stops at the first False. When using generators, this means the full sequence may never be exhausted. A list comprehension would evaluate everything first - wasting work.
:::
Generator Expressions in Pipelines
# Multi-stage pipeline - each stage is a generator
# No stage materializes until the final consumer
raw_data = range(1_000_000)
# Stage 1: Filter out negatives (in real code: read from file or DB)
stage1 = (x for x in raw_data if x % 7 == 0)
# Stage 2: Transform
stage2 = (x * x for x in stage1)
# Stage 3: Additional filter
stage3 = (x for x in stage2 if x < 10_000)
# Only NOW does data flow through the pipeline
result = list(stage3) # Pulls data through all three stages
# Memory used at any moment: O(1) - just the current item being processed
Part 6 - Dict Comprehensions and Set Comprehensions
Dict Comprehensions
# Basic: {key_expr: value_expr for var in iterable}
squares_dict = {x: x * x for x in range(6)}
print(squares_dict)
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# Invert a dictionary (swap keys and values)
original = {"a": 1, "b": 2, "c": 3}
inverted = {v: k for k, v in original.items()}
print(inverted) # {1: 'a', 2: 'b', 3: 'c'}
# Filter while building
users = [
{"name": "Alice", "active": True, "score": 85},
{"name": "Bob", "active": False, "score": 92},
{"name": "Carol", "active": True, "score": 78},
]
active_scores = {u["name"]: u["score"] for u in users if u["active"]}
print(active_scores) # {'Alice': 85, 'Carol': 78}
# Normalize keys: lowercase all keys from an external API response
api_response = {"UserID": 42, "UserName": "alice", "Score": 99}
normalized = {k.lower(): v for k, v in api_response.items()}
print(normalized) # {'userid': 42, 'username': 'alice', 'score': 99}
Set Comprehensions
# Basic: {expr for var in iterable} - note: curly braces, no colon
unique_lengths = {len(word) for word in ["cat", "dog", "elephant", "ant", "bear"]}
print(unique_lengths) # {3, 8, 4} (order not guaranteed)
# Extract unique domains from email list
emails = [
]
domains = {email.split("@")[1] for email in emails}
print(domains) # {'gmail.com', 'yahoo.com', 'outlook.com'}
# Set comprehension for deduplication with transform
numbers = [1, -2, 3, -3, 2, -1, 4, -4]
unique_abs = {abs(n) for n in numbers}
print(unique_abs) # {1, 2, 3, 4}
:::note Empty Dict vs Empty Set
{} creates an empty dict, not an empty set. Use set() to create an empty set. But {x for x in items} (with an expression) correctly creates a set comprehension.
:::
Part 7 - Variable Scoping in Comprehensions
Python 3 Fixed the Python 2 Scope Leak
In Python 2, the loop variable in a list comprehension leaked into the enclosing scope:
# Python 2 behavior (DO NOT RELY ON THIS):
# x = "outer"
# result = [x for x in range(5)]
# print(x) # Would print: 4 (leaked!)
# Python 3 behavior (correct):
x = "outer"
result = [x for x in range(5)]
print(x) # "outer" - comprehension variable does NOT leak
Comprehensions in Python 3 create their own scope, implemented as a nested function in bytecode. The loop variable is local to that implicit function.
# Proof: the outer x is unchanged
x = 100
squares = [x * x for x in range(5)] # x is local to comprehension
print(x) # 100 - unchanged
print(squares) # [0, 1, 4, 9, 16]
# But the comprehension CAN close over outer variables
base = 10
result = [base + x for x in range(5)] # 'base' is captured from outer scope
print(result) # [10, 11, 12, 13, 14]
:::warning Generator Expressions and Scope
Generator expressions also have their own scope. The iterable of the outermost for clause is evaluated immediately in the enclosing scope, but everything else (nested fors, conditions, expressions) is evaluated lazily inside the generator's scope.
:::
# The outermost iterable is evaluated immediately
data = [1, 2, 3]
gen = (x * 2 for x in data) # 'data' is captured NOW
data = [10, 20, 30] # Rebinding data has no effect
print(list(gen)) # [2, 4, 6] - used original data
Part 8 - When NOT to Use Comprehensions
The Readability Threshold
Comprehensions express "produce a collection by transforming/filtering an iterable." When the transformation logic is complex, a comprehension becomes harder to read than a loop.
# GOOD: Simple, reads like English
evens = [x for x in data if x % 2 == 0]
names = [user.name for user in users]
# ACCEPTABLE: Slight complexity but still clear
active_emails = [u.email.lower() for u in users if u.is_active and u.email]
# BAD: Complexity has exceeded the readability threshold
result = [
process(item)
for sublist in nested_data
for item in sublist
if item.status == "valid"
if item.score > threshold
if not item.is_excluded
]
For that last example, a loop with meaningful variable names is clearer:
# Better as a loop - the complexity is exposed and manageable
result = []
for sublist in nested_data:
for item in sublist:
if item.status != "valid":
continue
if item.score <= threshold:
continue
if item.is_excluded:
continue
result.append(process(item))
Never Use Comprehensions for Side Effects
# WRONG: Using comprehension for side effects
[print(x) for x in range(5)] # Creates [None, None, None, None, None]
[db.save(record) for record in records] # Wasteful list creation
# RIGHT: Use a loop when the purpose is side effects
for x in range(5):
print(x)
for record in records:
db.save(record)
:::danger Side Effects in Comprehensions
Using a list comprehension solely for its side effects (print, save, mutate) creates a throwaway list of None values. This wastes memory and communicates the wrong intent. Comprehensions are for building collections. Loops are for executing sequences of actions.
:::
Part 9 - Memory Deep Dive: sys.getsizeof Comparison
import sys
N = 100_000
# Method 1: List comprehension
list_comp = [x * x for x in range(N)]
size_list = sys.getsizeof(list_comp)
# Plus size of each integer object - but small ints are cached
# Method 2: Generator (does NOT build anything)
gen_expr = (x * x for x in range(N))
size_gen = sys.getsizeof(gen_expr)
# Method 3: Traditional loop result
loop_result = []
for x in range(N):
loop_result.append(x * x)
size_loop = sys.getsizeof(loop_result)
print(f"List comprehension: {size_list:,} bytes")
print(f"Generator: {size_gen:,} bytes")
print(f"Loop result: {size_loop:,} bytes")
Output:
List comprehension: 800,984 bytes
Generator: 200 bytes
Loop result: 800,984 bytes
Key observations:
- List comprehension and loop result use the same memory - they both produce a list
- Generator uses constant memory (200 bytes) regardless of
N - The list stores pointers (8 bytes each), so
Npointers =8Nbytes + overhead
| Structure | Memory (N=100) | Notes |
|---|---|---|
| List comprehension result | ~856+ bytes | 56 B struct + 100 pointers × 8 B + integer objects on heap |
| Generator expression | ~200 bytes | Frame + state only - constant regardless of N |
Part 10 - Real-World Data Transformation Patterns
Pattern 1: ETL Pipeline (Extract-Transform-Load)
# Raw API response data
raw_records = [
{"user_id": "U001", "name": " Alice ", "score": "85", "active": "true"},
{"user_id": "U002", "name": " Bob ", "score": "72", "active": "false"},
{"user_id": "U003", "name": " Carol ", "score": "91", "active": "true"},
{"user_id": "U004", "name": " Dave ", "score": "68", "active": "true"},
]
# Transform: clean, type-cast, and filter in one comprehension
clean_active_users = [
{
"user_id": r["user_id"],
"name": r["name"].strip(),
"score": int(r["score"]),
}
for r in raw_records
if r["active"] == "true"
]
print(clean_active_users)
# [
# {'user_id': 'U001', 'name': 'Alice', 'score': 85},
# {'user_id': 'U003', 'name': 'Carol', 'score': 91},
# {'user_id': 'U004', 'name': 'Dave', 'score': 68}
# ]
Pattern 2: Building an Index (Dict Comprehension)
# Build a lookup index: id → record
users = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"},
{"id": 3, "name": "Carol"},
]
user_index = {u["id"]: u for u in users}
# O(1) lookup now instead of O(n) scan
print(user_index[2]) # {'id': 2, 'name': 'Bob'}
Pattern 3: Frequency Count (Dict Comprehension + Counter)
from collections import Counter
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
# Using dict comprehension with Counter
freq = {word: count for word, count in Counter(words).items() if count > 1}
print(freq) # {'apple': 3, 'banana': 2}
# Or just keep top N
top_2 = {word: count for word, count in Counter(words).most_common(2)}
print(top_2) # {'apple': 3, 'banana': 2}
Pattern 4: Flattening Nested API Responses
# API returns nested: departments → employees
departments = [
{"dept": "Engineering", "employees": ["Alice", "Bob", "Carol"]},
{"dept": "Sales", "employees": ["Dave", "Eve"]},
{"dept": "HR", "employees": ["Frank"]},
]
# Flatten: all employee names
all_employees = [emp for dept in departments for emp in dept["employees"]]
print(all_employees)
# ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank']
# With metadata: keep dept association
employee_dept = [
(emp, dept["dept"])
for dept in departments
for emp in dept["employees"]
]
print(employee_dept)
# [('Alice', 'Engineering'), ('Bob', 'Engineering'), ...]
Interview Questions
Q1: Why are list comprehensions faster than equivalent for loops in CPython?
Answer: List comprehensions use the LIST_APPEND bytecode opcode, which is a direct C-level list append that avoids the overhead of attribute lookup (result.append), method binding, and the Python function call machinery (CALL_METHOD). A for loop using result.append() must perform these steps on every iteration. Benchmarks typically show comprehensions are 25-35% faster than equivalent loops. However, for trivial transforms the difference narrows, and map() with a built-in (non-lambda) function can sometimes outperform both.
Q2: What is the difference between [x for x in data if x > 0] and [x if x > 0 else 0 for x in data]?
Answer: The first is a filter - items where x > 0 is False are excluded from the output list. The result may be shorter than the input. The second is a conditional expression (ternary) in the output position - every item from data produces an output, but the value is either x (if x > 0) or 0 (otherwise). The result has the same length as the input. The trailing if filters; the ternary if-else in the expression position transforms.
Q3: When should you use a generator expression instead of a list comprehension?
Answer: Use a generator expression when: (1) you only need to iterate through the results once - e.g., sum(x*x for x in data), (2) the dataset is large and you want to avoid materializing all results in memory, (3) you are building a lazy pipeline and want data to flow through stages on demand. Use a list comprehension when: you need to iterate multiple times, access by index, call len(), or pass to a function that requires an actual list. Generators use O(1) memory; list comprehensions use O(n) memory.
Q4: How does variable scoping inside comprehensions work in Python 3?
Answer: In Python 3, comprehensions have their own enclosing scope - the loop variable does not leak into the surrounding scope. CPython implements this by compiling the comprehension as a nested function that is immediately invoked. The loop variable is local to that implicit function. However, the comprehension CAN read (close over) variables from the enclosing scope - it just cannot write back to them through the loop variable. This was a well-known Python 2 bug where the loop variable leaked out: x = "outer"; [x for x in range(5)]; print(x) printed 4 in Python 2 and "outer" in Python 3.
Q5: What is the performance difference between map(str.lower, words) and [w.lower() for w in words]?
Answer: map(str.lower, words) is often faster because str.lower is a C-level method descriptor called directly by map's C implementation - no Python function call stack frame is created per element. The list comprehension must create a Python-level call to w.lower() each iteration. The difference matters at scale: for 1M elements, map(str.lower, words) can be 30-50% faster. However, if you need a lambda (e.g., map(lambda w: w.lower()[:5], words)), the lambda overhead often makes it slower than the equivalent comprehension. Use map with built-in methods; use comprehensions when you need Python-level expressions.
Q6: What happens if you use a list comprehension for side effects like [db.save(r) for r in records]?
Answer: This creates a list of return values from db.save(r) - if save() returns None (common for write operations), you get [None, None, None, ...]. This wastes memory allocating a list you immediately discard. It also communicates the wrong intent: comprehensions signal "I am building a collection", not "I am performing actions". The correct form is a plain for loop. Linters like pylint and flake8 will flag list comprehensions used only for side effects.
Practice Challenges
Beginner - Predict the Output
What does this code print?
x = 50
result = [x for x in range(5)]
print(x)
print(result)
Solution
Output:
50
[0, 1, 2, 3, 4]
In Python 3, the x inside the comprehension is scoped to the comprehension - it does not overwrite the outer x = 50. The outer x remains 50 after the comprehension executes. This is fundamentally different from Python 2 behavior where the loop variable would have leaked and overwritten the outer x.
Intermediate - Refactor for Correctness and Performance
The following code has a correctness issue and a performance issue. Identify and fix both.
def get_unique_active_user_domains(users):
"""users: list of dicts with 'email' and 'active' keys"""
domains = []
for user in users:
if user["active"]:
domain = user["email"].split("@")[1]
if domain not in domains: # <-- issue here
domains.append(domain)
return domains
users = [
]
Solution
Issues:
- Performance:
if domain not in domainsis O(n) on a list - asdomainsgrows, this check takes longer. For 10,000 unique domains, this is 10,000 × 10,000 / 2 = 50M comparisons. Use a set for O(1) membership. - Ordering: Using a set loses insertion order (though in Python 3.7+ dicts preserve order).
Fixed version using set comprehension:
def get_unique_active_user_domains(users):
return {
user["email"].split("@")[1]
for user in users
if user["active"]
}
result = get_unique_active_user_domains(users)
print(result) # {'gmail.com', 'yahoo.com'}
The set comprehension automatically deduplicates and is O(n) total - one pass, O(1) per membership check (handled implicitly by set semantics).
If order matters (insertion-order-preserving unique list):
def get_unique_active_user_domains_ordered(users):
seen = set()
return [
domain
for user in users
if user["active"]
for domain in [user["email"].split("@")[1]] # walrus trick without walrus
if domain not in seen and not seen.add(domain)
]
Or more clearly with the walrus operator (Python 3.8+):
def get_unique_active_user_domains_ordered(users):
seen = set()
return [
domain
for user in users
if user["active"]
if (domain := user["email"].split("@")[1]) not in seen
if not seen.add(domain)
]
Or simply the most readable loop:
def get_unique_active_user_domains_ordered(users):
seen = set()
result = []
for user in users:
if not user["active"]:
continue
domain = user["email"].split("@")[1]
if domain not in seen:
seen.add(domain)
result.append(domain)
return result
Advanced - Build a Multi-Stage Data Pipeline
You are processing log entries from a web server. Each log entry is a string in Apache Combined Log Format. You need to extract a frequency distribution of HTTP status codes, but only for requests that:
- Are not from the internal network (IP does not start with
10.or192.168.) - Took longer than 200ms (field at the end)
- Came from a real browser (User-Agent contains "Mozilla")
Design this as a generator-based pipeline and a dict comprehension for the final aggregation.
import re
from collections import Counter
# Sample log lines (simplified)
log_lines = [
'10.0.0.1 - - [01/Mar/2026] "GET /api" 200 1234 "Mozilla/5.0" 150ms',
'203.0.113.1 - - [01/Mar/2026] "GET /home" 200 5678 "Mozilla/5.0" 350ms',
'203.0.113.2 - - [01/Mar/2026] "POST /api" 404 890 "curl/7.68" 80ms',
'192.168.1.5 - - [01/Mar/2026] "GET /admin" 403 123 "Mozilla/5.0" 500ms',
'198.51.100.3 - - [01/Mar/2026] "GET /api" 500 456 "Mozilla/5.0" 620ms',
'203.0.113.3 - - [01/Mar/2026] "GET /api" 200 789 "Mozilla/5.0" 250ms',
'203.0.113.4 - - [01/Mar/2026] "GET /api" 429 321 "Mozilla/5.0" 210ms',
]
Solution
import re
from collections import Counter
log_lines = [
'10.0.0.1 - - [01/Mar/2026] "GET /api" 200 1234 "Mozilla/5.0" 150ms',
'203.0.113.1 - - [01/Mar/2026] "GET /home" 200 5678 "Mozilla/5.0" 350ms',
'203.0.113.2 - - [01/Mar/2026] "POST /api" 404 890 "curl/7.68" 80ms',
'192.168.1.5 - - [01/Mar/2026] "GET /admin" 403 123 "Mozilla/5.0" 500ms',
'198.51.100.3 - - [01/Mar/2026] "GET /api" 500 456 "Mozilla/5.0" 620ms',
'203.0.113.3 - - [01/Mar/2026] "GET /api" 200 789 "Mozilla/5.0" 250ms',
'203.0.113.4 - - [01/Mar/2026] "GET /api" 429 321 "Mozilla/5.0" 210ms',
]
LOG_PATTERN = re.compile(
r'^(?P<ip>\S+).*?"(?:\w+) \S+ \S+" (?P<status>\d{3}) \d+ "(?P<ua>[^"]+)" (?P<ms>\d+)ms$'
)
def parse_log(line):
"""Returns a dict if line matches, else None."""
m = LOG_PATTERN.match(line)
if not m:
return None
return {
"ip": m.group("ip"),
"status": int(m.group("status")),
"ua": m.group("ua"),
"ms": int(m.group("ms")),
}
def is_internal_ip(ip):
return ip.startswith("10.") or ip.startswith("192.168.")
# Stage 1: Parse - generator, no memory allocation until consumed
parsed = (parse_log(line) for line in log_lines)
# Stage 2: Drop unparseable lines
valid = (record for record in parsed if record is not None)
# Stage 3: Filter by rules - external IP, slow response, browser UA
filtered = (
record
for record in valid
if not is_internal_ip(record["ip"])
if record["ms"] > 200
if "Mozilla" in record["ua"]
)
# Stage 4: Consume - build status frequency dict
# All of stages 1-3 execute lazily here in one pass
status_counts = dict(Counter(record["status"] for record in filtered))
print(status_counts)
# {200: 2, 500: 1, 429: 1}
# Dict comprehension form for the final result (percentage breakdown)
total = sum(status_counts.values())
status_pct = {
status: f"{count/total*100:.1f}%"
for status, count in sorted(status_counts.items())
}
print(status_pct)
# {200: '50.0%', 429: '25.0%', 500: '25.0%'}
Design analysis:
- Stages 1-3 are generators - zero memory allocated until stage 4 pulls data through
- All filtering and transformation happens in a single pass - O(n) time, O(1) intermediate memory
- The
Counterat stage 4 is the only point where a data structure is materialized - The dict comprehension at the end is O(k) where k = number of unique status codes (tiny)
is_internal_ipextracted as a function - makes the generator expression readable
Quick Reference
| Form | Syntax | Returns | Eagerness | Use Case |
|---|---|---|---|---|
| List comprehension | [expr for x in it] | list | Eager | Need list, multiple iteration |
| Filtered list comp | [expr for x in it if cond] | list | Eager | Filter + transform |
| Ternary in comp | [a if cond else b for x in it] | list | Eager | Transform all, vary output |
| Generator expression | (expr for x in it) | generator | Lazy | Single pass, large data |
| Dict comprehension | {k: v for k, v in it} | dict | Eager | Build lookup tables |
| Set comprehension | {expr for x in it} | set | Eager | Unique values, O(1) lookup |
| Nested flatten | [x for row in m for x in row] | list | Eager | Flatten 2D → 1D |
| Nested build | [[expr for j in cols] for i in rows] | list of lists | Eager | Build matrix |
| Operation | Comprehension | map() | loop |
|---|---|---|---|
| Simple transform | Fast | Faster (built-in) | Slowest |
| Lambda transform | Fast | Slower (lambda overhead) | Slowest |
| Filter + transform | Fast | Awkward | Clear |
| Side effects | Wrong tool | Wrong tool | Correct tool |
| Memory for N=1M | ~8 MB | ~8 MB (after list()) | ~8 MB |
| Memory lazy | Use generator | map() is lazy | Accumulate manually |
Key Takeaways
- List comprehensions use the
LIST_APPENDopcode - avoiding attribute lookup and method call overhead - making them 25-35% faster than equivalentforloops map()with a built-in function (not a lambda) can outperform comprehensions because no Python function call is needed per element- Nested comprehension for-clauses read left-to-right matching equivalent nested loops:
[x for row in matrix for x in row]= outer loop first - Generator expressions
(expr for x in it)use O(1) memory regardless of input size - prefer them forsum,any,all, and large single-pass transformations - In Python 3, comprehension loop variables are scoped to the comprehension and do not leak - a critical fix over Python 2
- Dict comprehensions
{k: v for k, v in items}and set comprehensions{x for x in data}are the Pythonic way to build mappings and unique collections - Never use a comprehension for side effects - use a
forloop; comprehensions are for producing collections - When logic exceeds approximately three conditions or two levels of nesting, switch to an explicit
forloop for readability
