Sets and Mathematical Operations - Hash-Based Membership Testing
Reading time: ~20 minutes | Level: Foundation → Engineering
Opening Challenge
Predict every line of output before continuing:
s = {3, 1, 4, 1, 5, 9, 2, 6, 5, 3}
print(s)
print(type({}))
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a & b)
print(a | b)
print(a - b)
print(b - a)
print(a ^ b)
print(a <= {1, 2, 3, 4, 5})
fs = frozenset([1, 2, 3])
d = {fs: "frozen key"}
print(d[frozenset([1, 2, 3])])
try:
bad = {[1, 2], [3, 4]}
except TypeError as e:
print(e)
import timeit
lst = list(range(1_000_000))
st = set(lst)
t1 = timeit.timeit(lambda: 999_999 in lst, number=1000)
t2 = timeit.timeit(lambda: 999_999 in st, number=1000)
print(f"list: {t1:.4f}s set: {t2:.6f}s")
{1, 2, 3, 4, 5, 6, 9}
<class 'dict'>
{3, 4}
{1, 2, 3, 4, 5, 6}
{1, 2}
{5, 6}
{1, 2, 5, 6}
True
frozen key
unhashable type: 'list'
list: 0.1823s set: 0.000047s
Three surprises: {} is a dict, not a set. The displayed order of s is not insertion order - it reflects internal hash table layout. And the timing gap between list and set membership testing for 1M elements is approximately 3000x. That final number is not a micro-optimization - it is the difference between a system that handles 100 requests per second and one that handles 300,000. This lesson explains why, all the way down to the hash table.
What You Will Learn
- How CPython implements sets as a hash table without values - only keys
setvsfrozenset: immutability, hashability, and when to use each- The four creation forms: literal, constructor, comprehension, and the empty-set trap
- O(1) membership testing: why it is the primary reason to choose sets over lists
- All mathematical set operations with ASCII Venn diagrams and both operator and method syntax
- In-place variants and their performance characteristics
- Quantified performance comparison:
x in listO(n) vsx in setO(1) with real timing - Engineering use cases: deduplication, fast membership testing, delta computation, config diffing
- Frozensets as dict keys and set-of-sets elements
- Iteration order: what CPython does vs what the language guarantees
- Every common pitfall with clear explanations and fixes
Prerequisites
You should understand Python lists and dicts at the level of the previous lessons. The hash table concepts covered in the Dictionary lesson apply directly to sets - a set is essentially a dict with keys only and no values.
What Is a Set? The Internal Reality
A Python set is an unordered collection of unique hashable elements. The definition sounds simple, but each word is load-bearing:
- Unordered: the iteration order is determined by hash values and table layout, not by insertion order
- Unique: duplicates are silently discarded; inserting an element that already exists is a no-op
- Hashable: every element must implement
__hash__()and__eq__()- the same requirement as dict keys
The CPython implementation is a hash table that stores only keys - no values. You can think of it as a dict where every value is the implicit constant True, except that CPython optimizes away the value storage entirely:
CPython set internal layout (simplified):
Hash table slots:
slot 0: EMPTY
slot 1: element=5 (hash(5) % table_size == 1)
slot 2: EMPTY
slot 3: element=3 (hash(3) % table_size == 3)
slot 4: element=1 (hash(1) % table_size == 4 - after probing, say)
slot 5: EMPTY
slot 6: EMPTY
slot 7: element=2 (hash(2) % table_size == 7)
No values stored - just the elements themselves.
This is why x in s is O(1): Python computes hash(x), jumps to the corresponding slot, and checks if that slot holds x. It is a direct array jump, not a scan.
CPython sets use open addressing with a probing strategy similar to dicts (perturbation-based). Each slot is either empty, occupied, or a dummy/tombstone (marking a previously occupied slot that was deleted). The same load factor threshold (approximately 2/3) triggers table doubling and rehashing.
set vs frozenset: Immutability Matters
Python provides two set types:
| Property | set | frozenset |
|---|---|---|
| Mutable | Yes | No |
| Hashable | No | Yes |
| Dict key | No | Yes |
| Element of set | No | Yes |
| Creation | {1, 2} or set() | frozenset([1, 2]) |
frozenset is the immutable counterpart to set. Because it cannot change, its hash can be computed once and cached, making it hashable. This enables two use cases impossible with set:
# frozenset as a dict key
permissions_cache = {}
read_write = frozenset({"read", "write"})
permissions_cache[read_write] = "standard access"
print(permissions_cache[frozenset({"read", "write"})]) # "standard access"
# frozenset as an element of another set (set of sets)
group_a = frozenset([1, 2, 3])
group_b = frozenset([3, 4, 5])
all_groups = {group_a, group_b} # set of frozensets - valid!
print(all_groups)
# {frozenset({1, 2, 3}), frozenset({3, 4, 5})}
# But set of sets fails
try:
bad = {set([1, 2]), set([3, 4])}
except TypeError as e:
print(e) # unhashable type: 'set'
frozenset supports all the read-only operations of set (membership testing, mathematical operations that return new sets) but not the mutating ones (.add(), .remove(), .discard(), .update()).
When you need to use a set as a dictionary key - for example, caching results keyed by which features are enabled, or indexing configurations by their permission sets - always use frozenset. It is directly hashable and works as a dict key out of the box.
Four Ways to Create Sets
# 1. Set literal - curly braces with elements, no colons
s1 = {1, 2, 3, 4}
# 2. set() constructor - converts any iterable, deduplicates automatically
s2 = set([1, 2, 2, 3, 3, 3]) # {1, 2, 3}
s3 = set("hello") # {'h', 'e', 'l', 'o'} - deduplicates 'l'
s4 = set(range(5)) # {0, 1, 2, 3, 4}
# 3. Set comprehension
squares = {x**2 for x in range(10)}
# {0, 1, 4, 9, 16, 25, 36, 49, 64, 81}
even_squares = {x**2 for x in range(10) if x % 2 == 0}
# {0, 4, 16, 36, 64}
# 4. Empty set - MUST use set(), not {}
empty_set = set()
empty_dict = {} # NOT a set - this is an empty dict
print(type(set())) # <class 'set'>
print(type({})) # <class 'dict'>
{} creates an empty dict, not an empty set. This is one of the most common Python traps. There is no set literal syntax for an empty set - you must always use set(). The asymmetry exists because {} predates set literals in Python's history, and changing it would break backward compatibility.
O(1) Membership Testing: Why Sets Exist
The single most important reason to use a set is x in s being O(1). This is not about convenience - it is about algorithmic complexity that determines whether your system scales.
Real timing demonstrates the practical magnitude:
import timeit
sizes = [1_000, 10_000, 100_000, 1_000_000]
for n in sizes:
lst = list(range(n))
st = set(lst)
target = n - 1 # worst case for list (last element)
t_list = timeit.timeit(lambda: target in lst, number=10_000)
t_set = timeit.timeit(lambda: target in st, number=10_000)
print(f"n={n:>8,} list={t_list:8.4f}s set={t_set:.6f}s "
f"speedup={t_list/t_set:,.0f}x")
n= 1,000 list= 0.0021s set=0.000046s speedup= 46x
n= 10,000 list= 0.0209s set=0.000047s speedup= 445x
n= 100,000 list= 0.2097s set=0.000048s speedup= 4,369x
n=1,000,000 list= 2.0941s set=0.000047s speedup=44,556x
The list time scales linearly with n. The set time is nearly constant - hovering around 47 microseconds for all sizes. At 1M elements, the speedup is nearly 45,000x. The set membership test time is literally independent of the set size (within the range of normal load factors).
Mathematical Set Operations
Python's set type implements the full algebra of mathematical sets. Both operator syntax (|, &, -, ^) and method syntax (.union(), .intersection(), etc.) are available. The method forms accept any iterable as the argument; the operator forms require both sides to be sets.
Union: All Elements from Both
With A = {1, 2, 3, 4} and B = {3, 4, 5, 6} - A-only: {1, 2}, shared: {3, 4}, B-only: {5, 6}
A ∪ B = all elements from both = {1, 2, 3, 4, 5, 6}
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a | b) # {1, 2, 3, 4, 5, 6}
print(a.union(b)) # {1, 2, 3, 4, 5, 6}
print(a.union([5, 6, 7])) # method accepts any iterable; {1,2,3,4,5,6,7}
Intersection: Elements in Both
A ∩ B = overlap only = {3, 4} (elements in both A and B)
print(a & b) # {3, 4}
print(a.intersection(b)) # {3, 4}
print(a.intersection([3, 4, 5, 99])) # {3, 4} - 5 and 99 not in a
Difference: In A but Not in B
A − B = in A only (not shared) = {1, 2}
print(a - b) # {1, 2} - in a, not in b
print(a.difference(b)) # {1, 2}
print(b - a) # {5, 6} - in b, not in a
print(b.difference(a)) # {5, 6}
Symmetric Difference: In Either but Not Both
A △ B = in either but not both = {1, 2, 5, 6} (excludes the shared {3, 4})
print(a ^ b) # {1, 2, 5, 6}
print(a.symmetric_difference(b)) # {1, 2, 5, 6}
Subset and Superset
x = {1, 2}
y = {1, 2, 3, 4}
print(x <= y) # True - x is a subset of y (all of x is in y)
print(x.issubset(y)) # True
print(y >= x) # True - y is a superset of x
print(y.issuperset(x)) # True
print(x < y) # True - strict subset (x != y)
print(x <= x) # True - non-strict (x is a subset of itself)
print(x < x) # False - strict subset requires proper containment
print(x.isdisjoint({5, 6})) # True - no elements in common
print(x.isdisjoint({2, 5})) # False - 2 is shared
In-Place Variants
These modify the set in place and are more memory-efficient for large sets than the operator forms (which create a new set):
s = {1, 2, 3}
s |= {4, 5} # union in place: s = {1, 2, 3, 4, 5}
s &= {2, 3, 4} # intersection in place: s = {2, 3, 4}
s -= {3} # difference in place: s = {2, 4}
s ^= {2, 6} # symmetric difference in place: s = {4, 6}
Complexity of Set Operations
| Operation | Complexity | Notes |
|---|---|---|
x in s | O(1) avg | Hash lookup |
s.add(x) | O(1) amortized | May trigger resize |
s.remove(x) | O(1) avg | KeyError if not present |
s.discard(x) | O(1) avg | No error if not present |
s.pop() | O(1) | Removes arbitrary element |
len(s) | O(1) | Maintained as a counter |
for x in s | O(n) | Iteration over all slots |
s | t (union) | O(len(s)+len(t)) | New set, all elements |
s & t (intersection) | O(min(len(s),len(t))) | Only need to check smaller set |
s - t (difference) | O(len(s)) | Check each element of s against t |
s ^ t (sym. diff.) | O(len(s)+len(t)) | Combine both sides |
s <= t (subset) | O(len(s)) | Each element of s checked in t |
s.isdisjoint(t) | O(min(len(s),len(t))) | Short-circuits on first common element |
set(iterable) | O(n) | n = len(iterable) |
Performance Comparison: Set vs Nested-Loop Algorithms
The real engineering win of sets is transforming nested-loop algorithms into linear-time operations.
Finding common elements between two large lists:
import time
import random
list_a = [random.randint(0, 10_000) for _ in range(100_000)]
list_b = [random.randint(0, 10_000) for _ in range(100_000)]
# Naive: O(n * m) - for every element in a, scan all of b
start = time.perf_counter()
common_naive = [x for x in list_a if x in list_b] # list_b scan is O(m) each time
t_naive = time.perf_counter() - start
# Set-based: O(n + m) - build set of b O(m), then check each element of a O(1) each
start = time.perf_counter()
set_b = set(list_b)
common_set = [x for x in list_a if x in set_b]
t_set = time.perf_counter() - start
print(f"Naive O(n*m): {t_naive:.4f}s")
print(f"Set O(n+m): {t_set:.4f}s")
print(f"Speedup: {t_naive/t_set:.0f}x")
Naive O(n*m): 2.3491s
Set O(n+m): 0.0089s
Speedup: 264x
With n=m=100,000, the naive approach requires 10 billion comparisons in the worst case. The set approach requires 200,000 operations. The fundamental algorithmic improvement comes from the O(1) membership test that sets provide.
Engineering Use Cases
Use case 1: Deduplication
# Remove duplicates from a list while preserving a result set
event_ids = [1, 2, 2, 3, 3, 3, 4, 1, 2]
unique_ids = set(event_ids) # {1, 2, 3, 4} - O(n)
# Preserve insertion order while deduplicating (Python 3.7+)
seen = set()
ordered_unique = []
for eid in event_ids:
if eid not in seen:
seen.add(eid)
ordered_unique.append(eid)
# [1, 2, 3, 4] - deduplicated, insertion order preserved
Use case 2: Fast membership testing in APIs
# Load authorized tokens once at startup
AUTHORIZED_TOKENS: set[str] = set(load_tokens_from_db())
def authenticate(token: str) -> bool:
return token in AUTHORIZED_TOKENS # O(1) per request
Use case 3: Config delta computation
old_config = {"feature_a", "feature_b", "feature_c"}
new_config = {"feature_b", "feature_c", "feature_d"}
added = new_config - old_config # {"feature_d"} - newly enabled
removed = old_config - new_config # {"feature_a"} - disabled
kept = old_config & new_config # {"feature_b", "feature_c"} - unchanged
print(f"Added: {added}, Removed: {removed}, Kept: {kept}")
Use case 4: Graph traversal - visited node tracking
def bfs(graph: dict, start: str) -> list:
visited = set() # O(1) membership check prevents revisiting
queue = [start]
result = []
while queue:
node = queue.pop(0)
if node in visited:
continue
visited.add(node)
result.append(node)
queue.extend(graph.get(node, []))
return result
Use case 5: Finding unique users across datasets
# Users who placed an order AND wrote a review (high-value segment)
ordered_users = set(db.query("SELECT user_id FROM orders"))
reviewed_users = set(db.query("SELECT user_id FROM reviews"))
high_value = ordered_users & reviewed_users # intersection
# Users who ordered but never reviewed (re-engagement target)
ordered_only = ordered_users - reviewed_users
Set Operations on Large Datasets
The O(n+m) complexity of set operations makes them the correct tool for large-dataset intersection and difference problems. Compare:
Nested loop intersection (naive):
for each element in A (n elements):
for each element in B (m elements):
if equal: add to result
→ O(n * m) time
Set-based intersection:
1. Build set from A: O(n)
2. For each element in B: O(1) membership test against A-set: O(m)
→ O(n + m) time
For n=10M, m=10M:
Nested: 10^14 operations - infeasible
Set: 2 * 10^7 operations - sub-second
# Real-world: find IPs present in two different nginx log files
def load_ips_from_log(filename: str) -> set[str]:
ips = set()
with open(filename) as f:
for line in f:
# Nginx combined log format: IP is the first field
ip = line.split()[0]
ips.add(ip)
return ips
# This runs in O(n) for each file - building the set as we parse
log_a_ips = load_ips_from_log("access_jan.log")
log_b_ips = load_ips_from_log("access_feb.log")
# All operations below are O(min/max of set sizes), not O(n*m)
both_months = log_a_ips & log_b_ips # returning IPs
only_jan = log_a_ips - log_b_ips # one-time visitors in Jan
only_feb = log_b_ips - log_a_ips # new visitors in Feb
either_month = log_a_ips | log_b_ips # total unique IPs across both
Frozensets as Dict Keys and in Sets of Sets
The canonical use of frozenset is as a hashable set - a set you can use wherever a set is conceptually correct but hashability is required:
# Caching results for a set of features (order doesn't matter)
feature_cache: dict[frozenset, object] = {}
def compute_result(features: set) -> object:
key = frozenset(features) # normalize to hashable form
if key not in feature_cache:
feature_cache[key] = expensive_computation(features)
return feature_cache[key]
# Works regardless of the order features were specified
r1 = compute_result({"auth", "logging", "rate_limit"})
r2 = compute_result({"logging", "rate_limit", "auth"}) # same key, hits cache
assert r1 is r2 # same object returned
# Set of unique permission bundles
admin = frozenset(["read", "write", "delete", "admin"])
editor = frozenset(["read", "write"])
viewer = frozenset(["read"])
role_sets = {admin, editor, viewer} # set of frozensets - valid
# Check if a given permission set is a known role
user_perms = frozenset(["read", "write"])
print(user_perms in role_sets) # True - matches editor
Iteration Order: What CPython Does vs What Python Guarantees
CPython's set iteration visits elements in hash table slot order - not insertion order. The order depends on hash values, table size, and collision resolution, and it can change when the table resizes.
s = set()
for i in [5, 3, 1, 4, 1, 9, 2, 6]:
s.add(i)
print(f"After adding {i}: {s}")
# The displayed order of elements reflects hash slot positions,
# not the order you added them.
The Python language specification makes no guarantee about set iteration order. This means:
- Do not write code that depends on sets iterating in a particular order
- Do not sort a set in your head by assuming small integers iterate in sorted order (they often appear sorted for small integers in CPython, but this is a coincidence of hash values for small ints, not a guarantee)
- If you need insertion-order iteration over unique elements, combine a list and a set: iterate the list, skip elements already in the seen-set
# Pattern: deduplicate while preserving insertion order
def unique_ordered(iterable):
seen = set()
result = []
for item in iterable:
if item not in seen:
seen.add(item)
result.append(item)
return result
print(unique_ordered([3, 1, 4, 1, 5, 9, 2, 6, 5]))
# [3, 1, 4, 5, 9, 2, 6] - duplicates removed, insertion order kept
Set Methods: Complete API Reference
s = {1, 2, 3}
# Adding elements
s.add(4) # add single element; no-op if already present
s.update([5, 6, 5]) # add all elements from iterable; equivalent to |=
# Removing elements
s.remove(4) # remove element; KeyError if not present
s.discard(99) # remove if present; no error if not present
popped = s.pop() # remove and return an arbitrary element
# Copying
s2 = s.copy() # shallow copy of the set
# Clearing
s.clear() # remove all elements; s becomes set()
# Query methods (non-mutating)
a = {1, 2, 3}
b = {3, 4, 5}
print(a.issubset(b)) # False
print(a.issuperset({1, 2})) # True
print(a.isdisjoint({6, 7})) # True
Pitfalls Reference
Pitfall 1: {} is an empty dict, not an empty set
x = {}
x.add(1) # AttributeError: 'dict' object has no attribute 'add'
# Fix: always use set() for an empty set
x = set()
x.add(1) # {1}
Pitfall 2: Set elements must be hashable
{[1, 2], [3, 4]} # TypeError: unhashable type: 'list'
{{1: 2}, {3: 4}} # TypeError: unhashable type: 'dict'
# Fix: use tuples for list-like composite elements, frozenset for set-like ones
{(1, 2), (3, 4)} # OK
{frozenset([1, 2])} # OK
Pitfall 3: Iteration order is not guaranteed - do not rely on it
s = {10, 20, 30}
first = next(iter(s)) # Might be 10, 20, or 30 - implementation-defined
# If you need ordered unique elements, use the unique_ordered() pattern above
Pitfall 4: Mutating a set while iterating raises RuntimeError
s = {1, 2, 3, 4, 5}
for x in s:
if x % 2 == 0:
s.discard(x) # RuntimeError: Set changed size during iteration
# Fix: iterate over a copy
for x in set(s): # or list(s)
if x % 2 == 0:
s.discard(x)
Pitfall 5: frozenset for set-of-sets - sets cannot contain regular sets
groups = {set([1, 2]), set([3, 4])} # TypeError: unhashable type: 'set'
groups = {frozenset([1, 2]), frozenset([3, 4])} # OK
Pitfall 6: Assuming set is faster than list for everything
Sets trade memory for speed. For very small collections (fewer than ~10 elements), list membership may be comparable or even faster because the constant factor overhead of hashing can exceed the cost of a short linear scan. Profile before optimizing tiny inner-loop collections.
Quick Reference Cheatsheet
| Operation | Operator | Method | Complexity | Notes |
|---|---|---|---|---|
| Create | {1, 2, 3} | set(iterable) | O(n) | {} is dict! |
| Empty set | - | set() | O(1) | No literal form |
| Membership | x in s | - | O(1) avg | Primary advantage |
| Add element | - | s.add(x) | O(1) amortized | No-op if present |
| Remove (strict) | - | s.remove(x) | O(1) avg | KeyError if absent |
| Remove (safe) | - | s.discard(x) | O(1) avg | No error if absent |
| Pop | - | s.pop() | O(1) | Arbitrary element |
| Update | s |= t | s.update(t) | O(len(t)) | Mutates s |
| Union | s | t | s.union(t) | O(n+m) | New set |
| Intersection | s & t | s.intersection(t) | O(min(n,m)) | New set |
| Difference | s - t | s.difference(t) | O(len(s)) | New set |
| Sym. diff. | s ^ t | s.symmetric_difference(t) | O(n+m) | New set |
| Subset | s <= t | s.issubset(t) | O(len(s)) | |
| Superset | s >= t | s.issuperset(t) | O(len(s)) | |
| Disjoint | - | s.isdisjoint(t) | O(min(n,m)) | No common elements |
| Frozenset | - | frozenset(iterable) | O(n) | Hashable, immutable |
| Comprehension | {expr for x in it} | - | O(n) | Auto-deduplicates |
| Copy | - | s.copy() | O(n) | Shallow copy |
| Clear | - | s.clear() | O(n) | Empties the set |
Interview Questions and Answers
Q1: Why is x in set O(1) but x in list is O(n)? Explain mechanically.
A list is a contiguous array of object pointers. To find x, Python must compare x against list[0], list[1], list[2], ... until it finds a match or exhausts the list. In the worst case (element not present, or at the end), this is n comparisons. A set is a hash table. Python computes hash(x) (O(1) for built-in types), uses it to jump directly to a specific slot in the internal array, and checks if that slot holds x. The jump is a constant-time array index operation. Regardless of whether the set has 10 or 10 million elements, the same number of operations are performed. The set's size only affects the hash table's load factor; as long as the load factor is below the threshold (CPython resizes at 2/3), the expected probe count is bounded by a small constant.
Q2: What is the difference between set and frozenset? When would you use each?
set is mutable and unhashable - you can add and remove elements, but you cannot use a set as a dictionary key or as an element of another set. frozenset is immutable and hashable - once created it cannot change, so its hash is stable and it can be used as a dict key or as an element of another set. Use set for local accumulation, filtering, and membership testing where you need to add/remove elements. Use frozenset when you need a set to serve as a dict key (e.g., caching results keyed by a feature set), as an element of another set (set of sets), or when you want to signal immutability as part of an API contract. Both support all the mathematical operations (|, &, -, ^).
Q3: How do sets help with deduplication, and how does the order work?
set(iterable) consumes an iterable and inserts each element into a hash table. When an element is already present (determined by hash and equality), the insertion is a no-op. The result is a collection with exactly one copy of each unique element. Order is NOT preserved - the iteration order of a set reflects internal hash table slot positions, not insertion order. If you need to deduplicate while preserving insertion order, combine a list and a set: iterate the input, check if item not in seen_set, and append to a result list only on first occurrence.
Q4: What is the time complexity of set union, intersection, and difference? Why does intersection have a special advantage?
Union is O(n+m) - every element from both sets must be considered. Difference is O(n) - for each element of the first set, check if it is in the second set (O(1) per check). Intersection can be O(min(n, m)) because CPython's implementation iterates over the smaller set and checks each element against the larger set. This is an asymptotic win when the sets are very different sizes - checking a small set's elements against a large set's hash table is much faster than an all-pairs comparison. For symmetric difference, it is O(n+m) because all elements of both sets must be evaluated.
Q5: What are the hashability requirements for set elements, and why?
Every element in a set must be hashable - it must implement both __hash__() (returning a stable integer) and __eq__(). This is required because sets use a hash table for storage. When you add an element, the hash determines which slot to use. When checking membership, the hash determines which slot to look in. If an element's hash could change (because the element is mutable), then after a mutation, the element would be in the wrong slot and membership testing would fail to find it. This would silently corrupt the set's invariant (every element is findable). CPython lists, dicts, and sets themselves do not implement __hash__ and cannot be set elements. Tuples can be set elements if all their elements are hashable; frozensets can always be set elements.
Q6: What is the {} trap, and what are the consequences of hitting it?
In Python, {} creates an empty dict, not an empty set. This is because curly-brace dict syntax predates set syntax in Python's history. The consequences: code like seen = {}; seen.add(1) raises AttributeError: 'dict' object has no attribute 'add', which is immediately obvious. The more dangerous form is when {} is used in a type annotation or as a default argument where both dict and set might be semantically plausible, leading to subtle bugs. The fix is always to use set() for an empty set. Set literals work for non-empty sets ({1, 2, 3}) because the absence of : distinguishes them from dicts. The asymmetry is unfortunate but immovable - use set() consistently for empty sets and document it in code reviews.
Graded Practice Challenges
Level 1 - Predict the Output
a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}
print(len(a | b))
print(len(a & b))
print(len(a - b))
print(len(a ^ b))
a.discard(3)
a.discard(99) # no error
print(len(a))
c = frozenset([10, 20, 30])
d = {c: "frozen"}
print(d[frozenset([20, 10, 30])])
print({x for x in range(10) if x % 3 == 0})
Show Answer
8
2
3
6
4
frozen
{0, 3, 6, 9}
Explanation:
a | b = {1,2,3,4,5,6,7,8}→ len = 8a & b = {4,5}→ len = 2a - b = {1,2,3}(in a but not b) → len = 3a ^ b = {1,2,3,6,7,8}(not in both) → len = 6- After
a.discard(3): a = {1,2,4,5}, len = 4;a.discard(99)is a no-op frozenset([20,10,30]) == frozenset([10,20,30])because sets are order-independent and frozensets compare by element equality → dict lookup succeeds, returns "frozen"- Set comprehension: x in range(10) where x % 3 == 0 → {0, 3, 6, 9}
Level 2 - Debug the Code
The following function is supposed to find words that appear in text A but not in text B (the "exclusive vocabulary" of document A). It has three bugs:
def exclusive_vocabulary(text_a: str, text_b: str) -> list:
words_a = set(text_a.split())
words_b = set(text_b.split())
exclusive = words_a & words_b # Bug 1
result = {} # Bug 2
for word in exclusive:
result.add(word.lower())
return sorted(result) # Bug 3 (consequence of Bug 2)
doc_a = "The cat sat on the mat"
doc_b = "The dog sat on the rug"
print(exclusive_vocabulary(doc_a, doc_b))
Show Answer
Bug 1: words_a & words_b is the intersection - words in BOTH. To find words exclusive to A (in A but not B), use words_a - words_b.
Bug 2: result = {} creates an empty dict, not an empty set. The subsequent result.add(word.lower()) raises AttributeError: 'dict' object has no attribute 'add'. Should be result = set().
Bug 3: This is a consequence of Bug 2 - with the correct set(), sorted(result) works fine and sorts the resulting strings alphabetically. It is not independently a bug, but it would produce a wrong result if the logic in Bug 1 were also wrong.
Fixed code:
def exclusive_vocabulary(text_a: str, text_b: str) -> list:
words_a = set(text_a.lower().split()) # normalize case at source
words_b = set(text_b.lower().split())
exclusive = words_a - words_b # Fixed Bug 1: difference, not intersection
result = set() # Fixed Bug 2: set(), not {}
for word in exclusive:
result.add(word)
return sorted(result) # Now works correctly
doc_a = "The cat sat on the mat"
doc_b = "The dog sat on the rug"
print(exclusive_vocabulary(doc_a, doc_b))
# ['cat', 'mat'] - "cat" and "mat" appear in doc_a but not doc_b
# "the", "sat", "on" appear in both, so they are excluded
Level 3 - Design Challenge: 10 Million Line Log Analysis
You are given two nginx access log files. Each file has approximately 10 million lines. Each line begins with an IP address (the first whitespace-separated token). You need to efficiently compute:
- IPs that appear in both log files (returning visitors)
- IPs that appear only in file A (visitors who left and did not return)
- IPs that appear only in file B (new visitors in the second period)
- The total count of unique IPs across both files
Constraints:
- Files are too large to hold fully in memory as a list of all lines (assume 2GB each)
- Each file fits in memory as a set of IP strings (typical IPv4 addresses are 7-15 bytes; 10M unique IPs take roughly 500MB as a Python set)
- The solution must complete in under 30 seconds on a modern laptop
- Design for correctness first, then optimize
Design the complete solution: data structures, algorithm, time and space complexity analysis, and the Python implementation. Then answer: how would you adapt this design if the IP sets themselves did not fit in memory?
Show Answer
Algorithm Design
The key insight is that set operations on 10M-element sets are O(n+m) - linear in the total number of unique elements, not quadratic. The naive approach of nested loops would be O(nm) ≈ 10^14 operations, which is completely infeasible. The set-based approach is O(n+m) ≈ 210^7 operations, completing in seconds.
Complexity Analysis:
- Loading each file into a set: O(n) and O(m) respectively
- All four computations (
&,-,-,|) are each O(n+m) - Total: O(n+m) time, O(n+m) space for the two sets
Python Implementation:
import time
from pathlib import Path
def extract_ips(filepath: str) -> set[str]:
"""
Stream through a log file line by line and extract unique IPs.
Memory usage: O(unique IPs), not O(total lines).
Time: O(n) where n = number of lines.
"""
ips: set[str] = set()
with open(filepath, 'r', buffering=1024 * 1024) as f: # 1MB read buffer
for line in f:
# Nginx combined log: IP is always the first token
# strip() + split() handles CRLF and extra whitespace
if line := line.strip():
ip = line.split(' ', 1)[0] # split once, take first part
ips.add(ip)
return ips
def analyze_logs(file_a: str, file_b: str) -> dict:
"""
Analyze two log files and compute IP set operations.
Returns a dict with all four computed sets and timing information.
"""
t0 = time.perf_counter()
print(f"Loading {file_a}...")
ips_a = extract_ips(file_a)
t1 = time.perf_counter()
print(f" Loaded {len(ips_a):,} unique IPs in {t1-t0:.2f}s")
print(f"Loading {file_b}...")
ips_b = extract_ips(file_b)
t2 = time.perf_counter()
print(f" Loaded {len(ips_b):,} unique IPs in {t2-t1:.2f}s")
# All four set operations - each is O(n+m)
returning = ips_a & ips_b # intersection
only_in_a = ips_a - ips_b # difference: A minus B
only_in_b = ips_b - ips_a # difference: B minus A
total_unique = ips_a | ips_b # union
t3 = time.perf_counter()
return {
"returning_visitors": returning,
"only_in_a": only_in_a,
"only_in_b": only_in_b,
"total_unique": total_unique,
"counts": {
"file_a_unique": len(ips_a),
"file_b_unique": len(ips_b),
"returning": len(returning),
"only_in_a": len(only_in_a),
"only_in_b": len(only_in_b),
"total_unique": len(total_unique),
},
"timing": {
"load_a": t1 - t0,
"load_b": t2 - t1,
"set_ops": t3 - t2,
"total": t3 - t0,
}
}
# --- Simulation for testing without real 10M-line files ---
import random
import tempfile
import os
def generate_test_log(filepath: str, n_lines: int, ip_range: tuple) -> None:
"""Generate a synthetic log file with n_lines entries."""
lo, hi = ip_range
with open(filepath, 'w') as f:
for _ in range(n_lines):
ip = f"192.168.{random.randint(0,255)}.{random.randint(lo, hi)}"
f.write(f"{ip} - - [01/Jan/2024:00:00:00 +0000] \"GET / HTTP/1.1\" 200 1234\n")
# Generate test files (100K lines for demo; scale to 10M for production)
with tempfile.TemporaryDirectory() as tmpdir:
file_a = os.path.join(tmpdir, "access_jan.log")
file_b = os.path.join(tmpdir, "access_feb.log")
generate_test_log(file_a, n_lines=100_000, ip_range=(0, 200))
generate_test_log(file_b, n_lines=100_000, ip_range=(100, 255))
results = analyze_logs(file_a, file_b)
print("\n=== Results ===")
for key, val in results["counts"].items():
print(f" {key}: {val:,}")
print(f"\nTiming: {results['timing']['total']:.2f}s total")
print(f" Load A: {results['timing']['load_a']:.2f}s")
print(f" Load B: {results['timing']['load_b']:.2f}s")
print(f" Set ops: {results['timing']['set_ops']:.4f}s")
When IP sets do not fit in memory (scale beyond RAM):
If even the set of unique IPs does not fit in RAM (e.g., IPv6 addresses, or truly massive scale), use one of these approaches:
-
Bloom filters (probabilistic): A Bloom filter for 10M elements requires roughly 95MB at 1% false-positive rate. Build a Bloom filter from file A, then query each IP from file B. False positives mean some "only in B" IPs are incorrectly classified as "returning" - acceptable for analytics, not for security decisions.
-
External sort and merge: Sort each file by IP (using Unix
sortwhich handles files larger than RAM via external merge sort). Then stream both sorted files simultaneously with a two-pointer merge, emitting classification as you go. Time: O(n log n) for sorting + O(n) for merging. Memory: O(1) beyond the sort buffers. -
Distributed processing: Use Apache Spark's
RDD.intersection(),subtract(), andunion()which distribute the work across a cluster. Each partition handles a subset of the IP space (hash partitioning ensures the same IP always goes to the same partition). -
HyperLogLog for count only: If you only need the count of unique IPs (not the actual IP addresses), HyperLogLog provides a count estimate with ~2% error using only kilobytes of memory.
Key Takeaways
- Python sets are hash tables that store only keys - conceptually a dict where every value is a constant that is never stored; this is why elements must be hashable
x in setis O(1) average regardless of set size;x in listis O(n) - this is not a minor optimization but a fundamental algorithmic difference that determines system scalability{}creates an empty dict, not an empty set; always useset()for an empty set - this is one of Python's most common trapsfrozensetis the immutable, hashable counterpart toset; use it as a dict key, as an element of another set, or whenever you need to pass a set of things as a single hashable unit- The four mathematical operations - union (
|), intersection (&), difference (-), symmetric difference (^) - all run in O(n+m) or better; intersection is O(min(n,m)) because CPython iterates the smaller set - In-place variants (
|=,&=,-=,^=) mutate the set and avoid allocating a new set object - prefer them in tight loops - Set-based algorithms transform O(n*m) nested-loop problems into O(n+m) problems - this is the set's highest engineering value
- Iteration order is unordered and implementation-defined - never rely on sets iterating in any particular sequence; if you need ordered unique elements, combine a list (for order) with a set (for O(1) deduplication)
- Elements must be hashable: no lists, dicts, or sets as elements; use tuples for list-like composite elements and frozensets for set-like composite elements
- For 10M-line log analysis, set operations are the correct tool: load each file's IPs into a set (O(n) each), then compute intersection, differences, and union - all O(n+m), completing in seconds rather than hours
