Skip to main content

Sets and Mathematical Operations - Hash-Based Membership Testing

Reading time: ~20 minutes | Level: Foundation → Engineering

Opening Challenge

Predict every line of output before continuing:

s = {3, 1, 4, 1, 5, 9, 2, 6, 5, 3}
print(s)
print(type({}))

a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(a & b)
print(a | b)
print(a - b)
print(b - a)
print(a ^ b)
print(a <= {1, 2, 3, 4, 5})

fs = frozenset([1, 2, 3])
d = {fs: "frozen key"}
print(d[frozenset([1, 2, 3])])

try:
bad = {[1, 2], [3, 4]}
except TypeError as e:
print(e)

import timeit
lst = list(range(1_000_000))
st = set(lst)
t1 = timeit.timeit(lambda: 999_999 in lst, number=1000)
t2 = timeit.timeit(lambda: 999_999 in st, number=1000)
print(f"list: {t1:.4f}s set: {t2:.6f}s")
{1, 2, 3, 4, 5, 6, 9}
<class 'dict'>
{3, 4}
{1, 2, 3, 4, 5, 6}
{1, 2}
{5, 6}
{1, 2, 5, 6}
True
frozen key
unhashable type: 'list'
list: 0.1823s set: 0.000047s

Three surprises: {} is a dict, not a set. The displayed order of s is not insertion order - it reflects internal hash table layout. And the timing gap between list and set membership testing for 1M elements is approximately 3000x. That final number is not a micro-optimization - it is the difference between a system that handles 100 requests per second and one that handles 300,000. This lesson explains why, all the way down to the hash table.

What You Will Learn

  • How CPython implements sets as a hash table without values - only keys
  • set vs frozenset: immutability, hashability, and when to use each
  • The four creation forms: literal, constructor, comprehension, and the empty-set trap
  • O(1) membership testing: why it is the primary reason to choose sets over lists
  • All mathematical set operations with ASCII Venn diagrams and both operator and method syntax
  • In-place variants and their performance characteristics
  • Quantified performance comparison: x in list O(n) vs x in set O(1) with real timing
  • Engineering use cases: deduplication, fast membership testing, delta computation, config diffing
  • Frozensets as dict keys and set-of-sets elements
  • Iteration order: what CPython does vs what the language guarantees
  • Every common pitfall with clear explanations and fixes

Prerequisites

You should understand Python lists and dicts at the level of the previous lessons. The hash table concepts covered in the Dictionary lesson apply directly to sets - a set is essentially a dict with keys only and no values.

What Is a Set? The Internal Reality

A Python set is an unordered collection of unique hashable elements. The definition sounds simple, but each word is load-bearing:

  • Unordered: the iteration order is determined by hash values and table layout, not by insertion order
  • Unique: duplicates are silently discarded; inserting an element that already exists is a no-op
  • Hashable: every element must implement __hash__() and __eq__() - the same requirement as dict keys

The CPython implementation is a hash table that stores only keys - no values. You can think of it as a dict where every value is the implicit constant True, except that CPython optimizes away the value storage entirely:

CPython set internal layout (simplified):

Hash table slots:
slot 0: EMPTY
slot 1: element=5 (hash(5) % table_size == 1)
slot 2: EMPTY
slot 3: element=3 (hash(3) % table_size == 3)
slot 4: element=1 (hash(1) % table_size == 4 - after probing, say)
slot 5: EMPTY
slot 6: EMPTY
slot 7: element=2 (hash(2) % table_size == 7)

No values stored - just the elements themselves.

This is why x in s is O(1): Python computes hash(x), jumps to the corresponding slot, and checks if that slot holds x. It is a direct array jump, not a scan.

note

CPython sets use open addressing with a probing strategy similar to dicts (perturbation-based). Each slot is either empty, occupied, or a dummy/tombstone (marking a previously occupied slot that was deleted). The same load factor threshold (approximately 2/3) triggers table doubling and rehashing.

set vs frozenset: Immutability Matters

Python provides two set types:

Propertysetfrozenset
MutableYesNo
HashableNoYes
Dict keyNoYes
Element of setNoYes
Creation{1, 2} or set()frozenset([1, 2])

frozenset is the immutable counterpart to set. Because it cannot change, its hash can be computed once and cached, making it hashable. This enables two use cases impossible with set:

# frozenset as a dict key
permissions_cache = {}
read_write = frozenset({"read", "write"})
permissions_cache[read_write] = "standard access"
print(permissions_cache[frozenset({"read", "write"})]) # "standard access"

# frozenset as an element of another set (set of sets)
group_a = frozenset([1, 2, 3])
group_b = frozenset([3, 4, 5])
all_groups = {group_a, group_b} # set of frozensets - valid!
print(all_groups)
# {frozenset({1, 2, 3}), frozenset({3, 4, 5})}

# But set of sets fails
try:
bad = {set([1, 2]), set([3, 4])}
except TypeError as e:
print(e) # unhashable type: 'set'

frozenset supports all the read-only operations of set (membership testing, mathematical operations that return new sets) but not the mutating ones (.add(), .remove(), .discard(), .update()).

tip

When you need to use a set as a dictionary key - for example, caching results keyed by which features are enabled, or indexing configurations by their permission sets - always use frozenset. It is directly hashable and works as a dict key out of the box.

Four Ways to Create Sets

# 1. Set literal - curly braces with elements, no colons
s1 = {1, 2, 3, 4}

# 2. set() constructor - converts any iterable, deduplicates automatically
s2 = set([1, 2, 2, 3, 3, 3]) # {1, 2, 3}
s3 = set("hello") # {'h', 'e', 'l', 'o'} - deduplicates 'l'
s4 = set(range(5)) # {0, 1, 2, 3, 4}

# 3. Set comprehension
squares = {x**2 for x in range(10)}
# {0, 1, 4, 9, 16, 25, 36, 49, 64, 81}

even_squares = {x**2 for x in range(10) if x % 2 == 0}
# {0, 4, 16, 36, 64}

# 4. Empty set - MUST use set(), not {}
empty_set = set()
empty_dict = {} # NOT a set - this is an empty dict

print(type(set())) # <class 'set'>
print(type({})) # <class 'dict'>
danger

{} creates an empty dict, not an empty set. This is one of the most common Python traps. There is no set literal syntax for an empty set - you must always use set(). The asymmetry exists because {} predates set literals in Python's history, and changing it would break backward compatibility.

O(1) Membership Testing: Why Sets Exist

The single most important reason to use a set is x in s being O(1). This is not about convenience - it is about algorithmic complexity that determines whether your system scales.

Real timing demonstrates the practical magnitude:

import timeit

sizes = [1_000, 10_000, 100_000, 1_000_000]

for n in sizes:
lst = list(range(n))
st = set(lst)
target = n - 1 # worst case for list (last element)

t_list = timeit.timeit(lambda: target in lst, number=10_000)
t_set = timeit.timeit(lambda: target in st, number=10_000)

print(f"n={n:>8,} list={t_list:8.4f}s set={t_set:.6f}s "
f"speedup={t_list/t_set:,.0f}x")
n= 1,000 list= 0.0021s set=0.000046s speedup= 46x
n= 10,000 list= 0.0209s set=0.000047s speedup= 445x
n= 100,000 list= 0.2097s set=0.000048s speedup= 4,369x
n=1,000,000 list= 2.0941s set=0.000047s speedup=44,556x

The list time scales linearly with n. The set time is nearly constant - hovering around 47 microseconds for all sizes. At 1M elements, the speedup is nearly 45,000x. The set membership test time is literally independent of the set size (within the range of normal load factors).

Mathematical Set Operations

Python's set type implements the full algebra of mathematical sets. Both operator syntax (|, &, -, ^) and method syntax (.union(), .intersection(), etc.) are available. The method forms accept any iterable as the argument; the operator forms require both sides to be sets.

Union: All Elements from Both

With A = {1, 2, 3, 4} and B = {3, 4, 5, 6} - A-only: {1, 2}, shared: {3, 4}, B-only: {5, 6}

A ∪ B = all elements from both = {1, 2, 3, 4, 5, 6}

a = {1, 2, 3, 4}
b = {3, 4, 5, 6}

print(a | b) # {1, 2, 3, 4, 5, 6}
print(a.union(b)) # {1, 2, 3, 4, 5, 6}
print(a.union([5, 6, 7])) # method accepts any iterable; {1,2,3,4,5,6,7}

Intersection: Elements in Both

A ∩ B = overlap only = {3, 4} (elements in both A and B)

print(a & b) # {3, 4}
print(a.intersection(b)) # {3, 4}
print(a.intersection([3, 4, 5, 99])) # {3, 4} - 5 and 99 not in a

Difference: In A but Not in B

A − B = in A only (not shared) = {1, 2}

print(a - b) # {1, 2} - in a, not in b
print(a.difference(b)) # {1, 2}

print(b - a) # {5, 6} - in b, not in a
print(b.difference(a)) # {5, 6}

Symmetric Difference: In Either but Not Both

A △ B = in either but not both = {1, 2, 5, 6} (excludes the shared {3, 4})

print(a ^ b) # {1, 2, 5, 6}
print(a.symmetric_difference(b)) # {1, 2, 5, 6}

Subset and Superset

x = {1, 2}
y = {1, 2, 3, 4}

print(x <= y) # True - x is a subset of y (all of x is in y)
print(x.issubset(y)) # True

print(y >= x) # True - y is a superset of x
print(y.issuperset(x)) # True

print(x < y) # True - strict subset (x != y)
print(x <= x) # True - non-strict (x is a subset of itself)
print(x < x) # False - strict subset requires proper containment

print(x.isdisjoint({5, 6})) # True - no elements in common
print(x.isdisjoint({2, 5})) # False - 2 is shared

In-Place Variants

These modify the set in place and are more memory-efficient for large sets than the operator forms (which create a new set):

s = {1, 2, 3}
s |= {4, 5} # union in place: s = {1, 2, 3, 4, 5}
s &= {2, 3, 4} # intersection in place: s = {2, 3, 4}
s -= {3} # difference in place: s = {2, 4}
s ^= {2, 6} # symmetric difference in place: s = {4, 6}

Complexity of Set Operations

OperationComplexityNotes
x in sO(1) avgHash lookup
s.add(x)O(1) amortizedMay trigger resize
s.remove(x)O(1) avgKeyError if not present
s.discard(x)O(1) avgNo error if not present
s.pop()O(1)Removes arbitrary element
len(s)O(1)Maintained as a counter
for x in sO(n)Iteration over all slots
s | t (union)O(len(s)+len(t))New set, all elements
s & t (intersection)O(min(len(s),len(t)))Only need to check smaller set
s - t (difference)O(len(s))Check each element of s against t
s ^ t (sym. diff.)O(len(s)+len(t))Combine both sides
s <= t (subset)O(len(s))Each element of s checked in t
s.isdisjoint(t)O(min(len(s),len(t)))Short-circuits on first common element
set(iterable)O(n)n = len(iterable)

Performance Comparison: Set vs Nested-Loop Algorithms

The real engineering win of sets is transforming nested-loop algorithms into linear-time operations.

Finding common elements between two large lists:

import time
import random

list_a = [random.randint(0, 10_000) for _ in range(100_000)]
list_b = [random.randint(0, 10_000) for _ in range(100_000)]

# Naive: O(n * m) - for every element in a, scan all of b
start = time.perf_counter()
common_naive = [x for x in list_a if x in list_b] # list_b scan is O(m) each time
t_naive = time.perf_counter() - start

# Set-based: O(n + m) - build set of b O(m), then check each element of a O(1) each
start = time.perf_counter()
set_b = set(list_b)
common_set = [x for x in list_a if x in set_b]
t_set = time.perf_counter() - start

print(f"Naive O(n*m): {t_naive:.4f}s")
print(f"Set O(n+m): {t_set:.4f}s")
print(f"Speedup: {t_naive/t_set:.0f}x")
Naive O(n*m): 2.3491s
Set O(n+m): 0.0089s
Speedup: 264x

With n=m=100,000, the naive approach requires 10 billion comparisons in the worst case. The set approach requires 200,000 operations. The fundamental algorithmic improvement comes from the O(1) membership test that sets provide.

Engineering Use Cases

Use case 1: Deduplication

# Remove duplicates from a list while preserving a result set
event_ids = [1, 2, 2, 3, 3, 3, 4, 1, 2]
unique_ids = set(event_ids) # {1, 2, 3, 4} - O(n)

# Preserve insertion order while deduplicating (Python 3.7+)
seen = set()
ordered_unique = []
for eid in event_ids:
if eid not in seen:
seen.add(eid)
ordered_unique.append(eid)
# [1, 2, 3, 4] - deduplicated, insertion order preserved

Use case 2: Fast membership testing in APIs

# Load authorized tokens once at startup
AUTHORIZED_TOKENS: set[str] = set(load_tokens_from_db())

def authenticate(token: str) -> bool:
return token in AUTHORIZED_TOKENS # O(1) per request

Use case 3: Config delta computation

old_config = {"feature_a", "feature_b", "feature_c"}
new_config = {"feature_b", "feature_c", "feature_d"}

added = new_config - old_config # {"feature_d"} - newly enabled
removed = old_config - new_config # {"feature_a"} - disabled
kept = old_config & new_config # {"feature_b", "feature_c"} - unchanged

print(f"Added: {added}, Removed: {removed}, Kept: {kept}")

Use case 4: Graph traversal - visited node tracking

def bfs(graph: dict, start: str) -> list:
visited = set() # O(1) membership check prevents revisiting
queue = [start]
result = []

while queue:
node = queue.pop(0)
if node in visited:
continue
visited.add(node)
result.append(node)
queue.extend(graph.get(node, []))

return result

Use case 5: Finding unique users across datasets

# Users who placed an order AND wrote a review (high-value segment)
ordered_users = set(db.query("SELECT user_id FROM orders"))
reviewed_users = set(db.query("SELECT user_id FROM reviews"))

high_value = ordered_users & reviewed_users # intersection

# Users who ordered but never reviewed (re-engagement target)
ordered_only = ordered_users - reviewed_users

Set Operations on Large Datasets

The O(n+m) complexity of set operations makes them the correct tool for large-dataset intersection and difference problems. Compare:

Nested loop intersection (naive):
for each element in A (n elements):
for each element in B (m elements):
if equal: add to result
→ O(n * m) time

Set-based intersection:
1. Build set from A: O(n)
2. For each element in B: O(1) membership test against A-set: O(m)
→ O(n + m) time

For n=10M, m=10M:
Nested: 10^14 operations - infeasible
Set: 2 * 10^7 operations - sub-second
# Real-world: find IPs present in two different nginx log files
def load_ips_from_log(filename: str) -> set[str]:
ips = set()
with open(filename) as f:
for line in f:
# Nginx combined log format: IP is the first field
ip = line.split()[0]
ips.add(ip)
return ips

# This runs in O(n) for each file - building the set as we parse
log_a_ips = load_ips_from_log("access_jan.log")
log_b_ips = load_ips_from_log("access_feb.log")

# All operations below are O(min/max of set sizes), not O(n*m)
both_months = log_a_ips & log_b_ips # returning IPs
only_jan = log_a_ips - log_b_ips # one-time visitors in Jan
only_feb = log_b_ips - log_a_ips # new visitors in Feb
either_month = log_a_ips | log_b_ips # total unique IPs across both

Frozensets as Dict Keys and in Sets of Sets

The canonical use of frozenset is as a hashable set - a set you can use wherever a set is conceptually correct but hashability is required:

# Caching results for a set of features (order doesn't matter)
feature_cache: dict[frozenset, object] = {}

def compute_result(features: set) -> object:
key = frozenset(features) # normalize to hashable form
if key not in feature_cache:
feature_cache[key] = expensive_computation(features)
return feature_cache[key]

# Works regardless of the order features were specified
r1 = compute_result({"auth", "logging", "rate_limit"})
r2 = compute_result({"logging", "rate_limit", "auth"}) # same key, hits cache
assert r1 is r2 # same object returned

# Set of unique permission bundles
admin = frozenset(["read", "write", "delete", "admin"])
editor = frozenset(["read", "write"])
viewer = frozenset(["read"])

role_sets = {admin, editor, viewer} # set of frozensets - valid

# Check if a given permission set is a known role
user_perms = frozenset(["read", "write"])
print(user_perms in role_sets) # True - matches editor

Iteration Order: What CPython Does vs What Python Guarantees

CPython's set iteration visits elements in hash table slot order - not insertion order. The order depends on hash values, table size, and collision resolution, and it can change when the table resizes.

s = set()
for i in [5, 3, 1, 4, 1, 9, 2, 6]:
s.add(i)
print(f"After adding {i}: {s}")

# The displayed order of elements reflects hash slot positions,
# not the order you added them.

The Python language specification makes no guarantee about set iteration order. This means:

  1. Do not write code that depends on sets iterating in a particular order
  2. Do not sort a set in your head by assuming small integers iterate in sorted order (they often appear sorted for small integers in CPython, but this is a coincidence of hash values for small ints, not a guarantee)
  3. If you need insertion-order iteration over unique elements, combine a list and a set: iterate the list, skip elements already in the seen-set
# Pattern: deduplicate while preserving insertion order
def unique_ordered(iterable):
seen = set()
result = []
for item in iterable:
if item not in seen:
seen.add(item)
result.append(item)
return result

print(unique_ordered([3, 1, 4, 1, 5, 9, 2, 6, 5]))
# [3, 1, 4, 5, 9, 2, 6] - duplicates removed, insertion order kept

Set Methods: Complete API Reference

s = {1, 2, 3}

# Adding elements
s.add(4) # add single element; no-op if already present
s.update([5, 6, 5]) # add all elements from iterable; equivalent to |=

# Removing elements
s.remove(4) # remove element; KeyError if not present
s.discard(99) # remove if present; no error if not present
popped = s.pop() # remove and return an arbitrary element

# Copying
s2 = s.copy() # shallow copy of the set

# Clearing
s.clear() # remove all elements; s becomes set()

# Query methods (non-mutating)
a = {1, 2, 3}
b = {3, 4, 5}
print(a.issubset(b)) # False
print(a.issuperset({1, 2})) # True
print(a.isdisjoint({6, 7})) # True

Pitfalls Reference

Pitfall 1: {} is an empty dict, not an empty set

x = {}
x.add(1) # AttributeError: 'dict' object has no attribute 'add'

# Fix: always use set() for an empty set
x = set()
x.add(1) # {1}

Pitfall 2: Set elements must be hashable

{[1, 2], [3, 4]} # TypeError: unhashable type: 'list'
{{1: 2}, {3: 4}} # TypeError: unhashable type: 'dict'

# Fix: use tuples for list-like composite elements, frozenset for set-like ones
{(1, 2), (3, 4)} # OK
{frozenset([1, 2])} # OK

Pitfall 3: Iteration order is not guaranteed - do not rely on it

s = {10, 20, 30}
first = next(iter(s)) # Might be 10, 20, or 30 - implementation-defined

# If you need ordered unique elements, use the unique_ordered() pattern above

Pitfall 4: Mutating a set while iterating raises RuntimeError

s = {1, 2, 3, 4, 5}
for x in s:
if x % 2 == 0:
s.discard(x) # RuntimeError: Set changed size during iteration

# Fix: iterate over a copy
for x in set(s): # or list(s)
if x % 2 == 0:
s.discard(x)

Pitfall 5: frozenset for set-of-sets - sets cannot contain regular sets

groups = {set([1, 2]), set([3, 4])} # TypeError: unhashable type: 'set'
groups = {frozenset([1, 2]), frozenset([3, 4])} # OK

Pitfall 6: Assuming set is faster than list for everything

Sets trade memory for speed. For very small collections (fewer than ~10 elements), list membership may be comparable or even faster because the constant factor overhead of hashing can exceed the cost of a short linear scan. Profile before optimizing tiny inner-loop collections.

Quick Reference Cheatsheet

OperationOperatorMethodComplexityNotes
Create{1, 2, 3}set(iterable)O(n){} is dict!
Empty set-set()O(1)No literal form
Membershipx in s-O(1) avgPrimary advantage
Add element-s.add(x)O(1) amortizedNo-op if present
Remove (strict)-s.remove(x)O(1) avgKeyError if absent
Remove (safe)-s.discard(x)O(1) avgNo error if absent
Pop-s.pop()O(1)Arbitrary element
Updates |= ts.update(t)O(len(t))Mutates s
Unions | ts.union(t)O(n+m)New set
Intersections & ts.intersection(t)O(min(n,m))New set
Differences - ts.difference(t)O(len(s))New set
Sym. diff.s ^ ts.symmetric_difference(t)O(n+m)New set
Subsets <= ts.issubset(t)O(len(s))
Supersets >= ts.issuperset(t)O(len(s))
Disjoint-s.isdisjoint(t)O(min(n,m))No common elements
Frozenset-frozenset(iterable)O(n)Hashable, immutable
Comprehension{expr for x in it}-O(n)Auto-deduplicates
Copy-s.copy()O(n)Shallow copy
Clear-s.clear()O(n)Empties the set

Interview Questions and Answers

Q1: Why is x in set O(1) but x in list is O(n)? Explain mechanically.

A list is a contiguous array of object pointers. To find x, Python must compare x against list[0], list[1], list[2], ... until it finds a match or exhausts the list. In the worst case (element not present, or at the end), this is n comparisons. A set is a hash table. Python computes hash(x) (O(1) for built-in types), uses it to jump directly to a specific slot in the internal array, and checks if that slot holds x. The jump is a constant-time array index operation. Regardless of whether the set has 10 or 10 million elements, the same number of operations are performed. The set's size only affects the hash table's load factor; as long as the load factor is below the threshold (CPython resizes at 2/3), the expected probe count is bounded by a small constant.

Q2: What is the difference between set and frozenset? When would you use each?

set is mutable and unhashable - you can add and remove elements, but you cannot use a set as a dictionary key or as an element of another set. frozenset is immutable and hashable - once created it cannot change, so its hash is stable and it can be used as a dict key or as an element of another set. Use set for local accumulation, filtering, and membership testing where you need to add/remove elements. Use frozenset when you need a set to serve as a dict key (e.g., caching results keyed by a feature set), as an element of another set (set of sets), or when you want to signal immutability as part of an API contract. Both support all the mathematical operations (|, &, -, ^).

Q3: How do sets help with deduplication, and how does the order work?

set(iterable) consumes an iterable and inserts each element into a hash table. When an element is already present (determined by hash and equality), the insertion is a no-op. The result is a collection with exactly one copy of each unique element. Order is NOT preserved - the iteration order of a set reflects internal hash table slot positions, not insertion order. If you need to deduplicate while preserving insertion order, combine a list and a set: iterate the input, check if item not in seen_set, and append to a result list only on first occurrence.

Q4: What is the time complexity of set union, intersection, and difference? Why does intersection have a special advantage?

Union is O(n+m) - every element from both sets must be considered. Difference is O(n) - for each element of the first set, check if it is in the second set (O(1) per check). Intersection can be O(min(n, m)) because CPython's implementation iterates over the smaller set and checks each element against the larger set. This is an asymptotic win when the sets are very different sizes - checking a small set's elements against a large set's hash table is much faster than an all-pairs comparison. For symmetric difference, it is O(n+m) because all elements of both sets must be evaluated.

Q5: What are the hashability requirements for set elements, and why?

Every element in a set must be hashable - it must implement both __hash__() (returning a stable integer) and __eq__(). This is required because sets use a hash table for storage. When you add an element, the hash determines which slot to use. When checking membership, the hash determines which slot to look in. If an element's hash could change (because the element is mutable), then after a mutation, the element would be in the wrong slot and membership testing would fail to find it. This would silently corrupt the set's invariant (every element is findable). CPython lists, dicts, and sets themselves do not implement __hash__ and cannot be set elements. Tuples can be set elements if all their elements are hashable; frozensets can always be set elements.

Q6: What is the {} trap, and what are the consequences of hitting it?

In Python, {} creates an empty dict, not an empty set. This is because curly-brace dict syntax predates set syntax in Python's history. The consequences: code like seen = {}; seen.add(1) raises AttributeError: 'dict' object has no attribute 'add', which is immediately obvious. The more dangerous form is when {} is used in a type annotation or as a default argument where both dict and set might be semantically plausible, leading to subtle bugs. The fix is always to use set() for an empty set. Set literals work for non-empty sets ({1, 2, 3}) because the absence of : distinguishes them from dicts. The asymmetry is unfortunate but immovable - use set() consistently for empty sets and document it in code reviews.

Graded Practice Challenges

Level 1 - Predict the Output

a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}

print(len(a | b))
print(len(a & b))
print(len(a - b))
print(len(a ^ b))

a.discard(3)
a.discard(99) # no error
print(len(a))

c = frozenset([10, 20, 30])
d = {c: "frozen"}
print(d[frozenset([20, 10, 30])])

print({x for x in range(10) if x % 3 == 0})
Show Answer
8
2
3
6
4
frozen
{0, 3, 6, 9}

Explanation:

  • a | b = {1,2,3,4,5,6,7,8} → len = 8
  • a & b = {4,5} → len = 2
  • a - b = {1,2,3} (in a but not b) → len = 3
  • a ^ b = {1,2,3,6,7,8} (not in both) → len = 6
  • After a.discard(3): a = {1,2,4,5}, len = 4; a.discard(99) is a no-op
  • frozenset([20,10,30]) == frozenset([10,20,30]) because sets are order-independent and frozensets compare by element equality → dict lookup succeeds, returns "frozen"
  • Set comprehension: x in range(10) where x % 3 == 0 → {0, 3, 6, 9}

Level 2 - Debug the Code

The following function is supposed to find words that appear in text A but not in text B (the "exclusive vocabulary" of document A). It has three bugs:

def exclusive_vocabulary(text_a: str, text_b: str) -> list:
words_a = set(text_a.split())
words_b = set(text_b.split())

exclusive = words_a & words_b # Bug 1

result = {} # Bug 2
for word in exclusive:
result.add(word.lower())

return sorted(result) # Bug 3 (consequence of Bug 2)

doc_a = "The cat sat on the mat"
doc_b = "The dog sat on the rug"
print(exclusive_vocabulary(doc_a, doc_b))
Show Answer

Bug 1: words_a & words_b is the intersection - words in BOTH. To find words exclusive to A (in A but not B), use words_a - words_b.

Bug 2: result = {} creates an empty dict, not an empty set. The subsequent result.add(word.lower()) raises AttributeError: 'dict' object has no attribute 'add'. Should be result = set().

Bug 3: This is a consequence of Bug 2 - with the correct set(), sorted(result) works fine and sorts the resulting strings alphabetically. It is not independently a bug, but it would produce a wrong result if the logic in Bug 1 were also wrong.

Fixed code:

def exclusive_vocabulary(text_a: str, text_b: str) -> list:
words_a = set(text_a.lower().split()) # normalize case at source
words_b = set(text_b.lower().split())

exclusive = words_a - words_b # Fixed Bug 1: difference, not intersection

result = set() # Fixed Bug 2: set(), not {}
for word in exclusive:
result.add(word)

return sorted(result) # Now works correctly

doc_a = "The cat sat on the mat"
doc_b = "The dog sat on the rug"
print(exclusive_vocabulary(doc_a, doc_b))
# ['cat', 'mat'] - "cat" and "mat" appear in doc_a but not doc_b
# "the", "sat", "on" appear in both, so they are excluded

Level 3 - Design Challenge: 10 Million Line Log Analysis

You are given two nginx access log files. Each file has approximately 10 million lines. Each line begins with an IP address (the first whitespace-separated token). You need to efficiently compute:

  1. IPs that appear in both log files (returning visitors)
  2. IPs that appear only in file A (visitors who left and did not return)
  3. IPs that appear only in file B (new visitors in the second period)
  4. The total count of unique IPs across both files

Constraints:

  • Files are too large to hold fully in memory as a list of all lines (assume 2GB each)
  • Each file fits in memory as a set of IP strings (typical IPv4 addresses are 7-15 bytes; 10M unique IPs take roughly 500MB as a Python set)
  • The solution must complete in under 30 seconds on a modern laptop
  • Design for correctness first, then optimize

Design the complete solution: data structures, algorithm, time and space complexity analysis, and the Python implementation. Then answer: how would you adapt this design if the IP sets themselves did not fit in memory?

Show Answer

Algorithm Design

The key insight is that set operations on 10M-element sets are O(n+m) - linear in the total number of unique elements, not quadratic. The naive approach of nested loops would be O(nm) ≈ 10^14 operations, which is completely infeasible. The set-based approach is O(n+m) ≈ 210^7 operations, completing in seconds.

Complexity Analysis:

  • Loading each file into a set: O(n) and O(m) respectively
  • All four computations (&, -, -, |) are each O(n+m)
  • Total: O(n+m) time, O(n+m) space for the two sets

Python Implementation:

import time
from pathlib import Path


def extract_ips(filepath: str) -> set[str]:
"""
Stream through a log file line by line and extract unique IPs.
Memory usage: O(unique IPs), not O(total lines).
Time: O(n) where n = number of lines.
"""
ips: set[str] = set()
with open(filepath, 'r', buffering=1024 * 1024) as f: # 1MB read buffer
for line in f:
# Nginx combined log: IP is always the first token
# strip() + split() handles CRLF and extra whitespace
if line := line.strip():
ip = line.split(' ', 1)[0] # split once, take first part
ips.add(ip)
return ips


def analyze_logs(file_a: str, file_b: str) -> dict:
"""
Analyze two log files and compute IP set operations.
Returns a dict with all four computed sets and timing information.
"""
t0 = time.perf_counter()

print(f"Loading {file_a}...")
ips_a = extract_ips(file_a)
t1 = time.perf_counter()
print(f" Loaded {len(ips_a):,} unique IPs in {t1-t0:.2f}s")

print(f"Loading {file_b}...")
ips_b = extract_ips(file_b)
t2 = time.perf_counter()
print(f" Loaded {len(ips_b):,} unique IPs in {t2-t1:.2f}s")

# All four set operations - each is O(n+m)
returning = ips_a & ips_b # intersection
only_in_a = ips_a - ips_b # difference: A minus B
only_in_b = ips_b - ips_a # difference: B minus A
total_unique = ips_a | ips_b # union

t3 = time.perf_counter()

return {
"returning_visitors": returning,
"only_in_a": only_in_a,
"only_in_b": only_in_b,
"total_unique": total_unique,
"counts": {
"file_a_unique": len(ips_a),
"file_b_unique": len(ips_b),
"returning": len(returning),
"only_in_a": len(only_in_a),
"only_in_b": len(only_in_b),
"total_unique": len(total_unique),
},
"timing": {
"load_a": t1 - t0,
"load_b": t2 - t1,
"set_ops": t3 - t2,
"total": t3 - t0,
}
}


# --- Simulation for testing without real 10M-line files ---
import random
import tempfile
import os

def generate_test_log(filepath: str, n_lines: int, ip_range: tuple) -> None:
"""Generate a synthetic log file with n_lines entries."""
lo, hi = ip_range
with open(filepath, 'w') as f:
for _ in range(n_lines):
ip = f"192.168.{random.randint(0,255)}.{random.randint(lo, hi)}"
f.write(f"{ip} - - [01/Jan/2024:00:00:00 +0000] \"GET / HTTP/1.1\" 200 1234\n")

# Generate test files (100K lines for demo; scale to 10M for production)
with tempfile.TemporaryDirectory() as tmpdir:
file_a = os.path.join(tmpdir, "access_jan.log")
file_b = os.path.join(tmpdir, "access_feb.log")

generate_test_log(file_a, n_lines=100_000, ip_range=(0, 200))
generate_test_log(file_b, n_lines=100_000, ip_range=(100, 255))

results = analyze_logs(file_a, file_b)

print("\n=== Results ===")
for key, val in results["counts"].items():
print(f" {key}: {val:,}")
print(f"\nTiming: {results['timing']['total']:.2f}s total")
print(f" Load A: {results['timing']['load_a']:.2f}s")
print(f" Load B: {results['timing']['load_b']:.2f}s")
print(f" Set ops: {results['timing']['set_ops']:.4f}s")

When IP sets do not fit in memory (scale beyond RAM):

If even the set of unique IPs does not fit in RAM (e.g., IPv6 addresses, or truly massive scale), use one of these approaches:

  1. Bloom filters (probabilistic): A Bloom filter for 10M elements requires roughly 95MB at 1% false-positive rate. Build a Bloom filter from file A, then query each IP from file B. False positives mean some "only in B" IPs are incorrectly classified as "returning" - acceptable for analytics, not for security decisions.

  2. External sort and merge: Sort each file by IP (using Unix sort which handles files larger than RAM via external merge sort). Then stream both sorted files simultaneously with a two-pointer merge, emitting classification as you go. Time: O(n log n) for sorting + O(n) for merging. Memory: O(1) beyond the sort buffers.

  3. Distributed processing: Use Apache Spark's RDD.intersection(), subtract(), and union() which distribute the work across a cluster. Each partition handles a subset of the IP space (hash partitioning ensures the same IP always goes to the same partition).

  4. HyperLogLog for count only: If you only need the count of unique IPs (not the actual IP addresses), HyperLogLog provides a count estimate with ~2% error using only kilobytes of memory.

Key Takeaways

  • Python sets are hash tables that store only keys - conceptually a dict where every value is a constant that is never stored; this is why elements must be hashable
  • x in set is O(1) average regardless of set size; x in list is O(n) - this is not a minor optimization but a fundamental algorithmic difference that determines system scalability
  • {} creates an empty dict, not an empty set; always use set() for an empty set - this is one of Python's most common traps
  • frozenset is the immutable, hashable counterpart to set; use it as a dict key, as an element of another set, or whenever you need to pass a set of things as a single hashable unit
  • The four mathematical operations - union (|), intersection (&), difference (-), symmetric difference (^) - all run in O(n+m) or better; intersection is O(min(n,m)) because CPython iterates the smaller set
  • In-place variants (|=, &=, -=, ^=) mutate the set and avoid allocating a new set object - prefer them in tight loops
  • Set-based algorithms transform O(n*m) nested-loop problems into O(n+m) problems - this is the set's highest engineering value
  • Iteration order is unordered and implementation-defined - never rely on sets iterating in any particular sequence; if you need ordered unique elements, combine a list (for order) with a set (for O(1) deduplication)
  • Elements must be hashable: no lists, dicts, or sets as elements; use tuples for list-like composite elements and frozensets for set-like composite elements
  • For 10M-line log analysis, set operations are the correct tool: load each file's IPs into a set (O(n) each), then compute intersection, differences, and union - all O(n+m), completing in seconds rather than hours
© 2026 EngineersOfAI. All rights reserved.