Data Structure Selection Strategy - Engineering the Right Choice
Reading time: ~22 minutes | Level: Foundation → Engineering
Here is code that works correctly but will destroy production at scale.
def find_duplicates(records: list) -> list:
"""Find all records that appear more than once."""
duplicates = []
for i, record in enumerate(records):
if record in records[i+1:]: # O(n) membership check!
if record not in duplicates: # O(n) membership check!
duplicates.append(record)
return duplicates
# Test
data = ["alpha", "beta", "alpha", "gamma", "beta", "alpha"]
print(find_duplicates(data)) # ['alpha', 'beta'] ← correct output
The output is correct. The code is a disaster.
Three nested O(n) operations inside a loop creates O(n³) behavior. On a list of 10,000 records, that is 10³ × 10³ × 10³ = 1 trillion operations. On a list of 1,000,000 records it would take years.
The fix is not a better algorithm - it is choosing the right data structure:
def find_duplicates_fast(records: list) -> list:
"""Find all records that appear more than once. O(n) time, O(n) space."""
from collections import Counter
counts = Counter(records)
return [record for record, count in counts.items() if count > 1]
data = ["alpha", "beta", "alpha", "gamma", "beta", "alpha"]
print(find_duplicates_fast(data)) # ['alpha', 'beta']
Same output. O(n) time. The only change was knowing which data structure to reach for.
This lesson teaches you the framework for making that decision correctly - every time.
What You Will Learn
- The engineering decision framework: operations × frequency → complexity budget
- A complete ASCII decision flowchart for choosing between Python's data structures
- How to read a problem statement and immediately map it to the right structure
- Five real-world case studies with analysis of wrong vs correct choices
- Space-time tradeoffs: when buying memory to save time is the right engineering call
- How to state a data structure choice in an interview - structure, justify, analyze
- The most common anti-patterns - and exactly why they fail at scale
- How the lessons in this module connect: list internals → dict internals → sets → deque → Counter → final synthesis
Prerequisites
- All prior lessons in Module 4 (Data Structures - Pythonic):
- Lesson 01: Lists Internals
- Lesson 06: Dictionaries Deep Dive
- Lesson 08: Sets and Frozensets
- Lesson 10: Tuples and Memory
- Lesson 13: Deque and Collections
- Lesson 14: Counter and Defaultdict
- Basic Big-O notation: O(1), O(log n), O(n), O(n log n), O(n²)
- Python
heapqmodule (introduced in this lesson)
Part 1 - The Decision Framework
Operations × Frequency → Complexity Budget
Every data structure decision starts with a single question: what operations will I perform most frequently?
The cost of your program is dominated by:
Total cost ≈ Σ (operation_frequency × operation_cost)
If you perform an O(n) operation once, that is fine. If you perform it n times in a loop, you have O(n²). The data structure must match the dominant operation pattern.
The Four Questions to Ask First
Before writing a single line of code, answer these:
1. OPERATIONS: What will I do to the data most often?
Read? Write? Search? Add at front? Prioritize?
2. FREQUENCY: How many times? Once? Per-request? Per-element in a loop?
3. SIZE: How large can the data grow?
100 elements? 1 million? Unbounded stream?
4. MUTABILITY: Does the data change after creation?
Immutable → tuple/namedtuple
Mutable → list/dict/set/deque
Part 2 - The Decision Flowchart
Read this top-to-bottom. Stop at the first matching node.
Special-purpose structures:
| Use case | Structure |
|---|---|
| Priority / min / max tracking | heapq (min-heap) |
| Counting frequencies | Counter |
| Grouping (one-to-many) | defaultdict(list) |
| Sliding window (last k elements) | deque(maxlen=k) |
| Layered config / scopes | ChainMap |
| LRU cache / order-sensitive | OrderedDict |
Part 3 - Structure-to-Use-Case Reference
Complete Complexity Matrix
| Structure | Access | Insert | Delete | Search / Membership |
|---|---|---|---|---|
list | O(1) by index | O(1)* end / O(n) front | O(n) | O(n) - linear scan; use set for repeated checks |
tuple | O(1) by index | N/A (immutable) | N/A (immutable) | O(n) - linear scan |
dict | O(1)* by key | O(1)* amortized | O(1)* amortized | O(1)* key lookup via hash; O(n) value search |
set | N/A | O(1)* amortized | O(1)* amortized | O(1)* hash-based - the whole point of set |
deque | O(n) by index | O(1) both ends | O(1) both ends | O(n) - not designed for search |
heapq (list) | O(1) min only | O(log n) | O(log n) for min | O(n) - use dict alongside for lookups |
Counter | O(1)* by key | O(1)* | O(1)* | O(1)* - same as dict; missing key returns 0 |
defaultdict | O(1)* by key | O(1)* | O(1)* | O(1)* - same as dict; missing key triggers factory |
OrderedDict | O(1)* by key | O(1)* | O(1)* | O(1)* - hash + order tracking; move_to_end() is O(1) |
ChainMap | O(k)* k=layers | O(1)* to first | O(1)* first map | O(k)* - k = number of maps; walks each map in order |
* Average case. Amortized where noted.
Quick Selection Guide
# ORDERED SEQUENCE: use list or deque
ordered_data = [1, 2, 3, 4, 5] # Random access → list
front_heavy = deque([1, 2, 3]) # Front ops → deque
# KEY-VALUE LOOKUP: use dict
user_profile = {'id': 42, 'name': 'Alice', 'role': 'admin'}
# MEMBERSHIP TEST AT SCALE: use set
allowed_ips = {'192.168.1.1', '10.0.0.1', '172.16.0.0'}
if request_ip in allowed_ips: # O(1), not O(n)
pass
# IMMUTABLE RECORD: use tuple or NamedTuple
from typing import NamedTuple
class Config(NamedTuple):
host: str
port: int = 5432
# PRIORITY / MIN TRACKING: use heapq
import heapq
tasks = []
heapq.heappush(tasks, (1, "high priority"))
heapq.heappush(tasks, (5, "low priority"))
next_task = heapq.heappop(tasks) # Always returns min: (1, "high priority")
# FREQUENCY COUNTING: use Counter
from collections import Counter
freq = Counter(["a", "b", "a", "c", "a", "b"])
print(freq.most_common(2)) # [('a', 3), ('b', 2)]
# GROUPING (one-to-many): use defaultdict(list)
from collections import defaultdict
groups = defaultdict(list)
for k, v in [("x", 1), ("y", 2), ("x", 3)]:
groups[k].append(v)
# SLIDING WINDOW: use deque(maxlen=k)
window = deque(maxlen=3)
for val in [1, 2, 3, 4, 5]:
window.append(val)
print(list(window)) # Always last 3 elements
Part 4 - Five Real-World Case Studies
Case Study 1: Find All Duplicates in a List
Problem: Given a list of n elements, find all elements that appear more than once. Preserve the order of first occurrence in the output.
Analysis of Approaches:
import time
data = list(range(10_000)) + list(range(5_000)) # 5000 duplicates
import random; random.shuffle(data)
# Approach 1: Nested loops - O(n²)
def find_dups_nested(lst):
result = []
for i, item in enumerate(lst):
if item in lst[i+1:]: # O(n) slice + O(n) search = O(n)
if item not in result: # O(n) search
result.append(item)
return result
# n=15000 → n² ≈ 225 million comparisons - takes seconds
# Approach 2: Sort + compare neighbors - O(n log n)
def find_dups_sort(lst):
sorted_lst = sorted(lst) # O(n log n)
dups = set()
for i in range(1, len(sorted_lst)):
if sorted_lst[i] == sorted_lst[i-1]:
dups.add(sorted_lst[i])
# But: this loses original order information
return [x for x in lst if x in dups] # O(n) pass to restore order
# Approach 3: Counter - O(n)
from collections import Counter
def find_dups_counter(lst):
counts = Counter(lst) # O(n) - one pass
seen_dups = {k for k, v in counts.items() if v > 1} # O(k) where k=unique
return [x for x in lst if x in seen_dups][:len(seen_dups)] # first occurrences
# Better: preserve insertion order without a second pass
def find_dups_optimal(lst):
"""O(n) time, O(n) space. Preserves first-occurrence order."""
counts = Counter(lst) # O(n) - one pass for all counts
seen = set()
result = []
for item in lst: # O(n) - one pass to collect in order
if counts[item] > 1 and item not in seen:
result.append(item)
seen.add(item)
return result
print(find_dups_optimal(data)[:5]) # First 5 duplicates (in order)
Complexity comparison:
| Approach | Time | Space | Notes |
|---|---|---|---|
| Nested loops | O(n²) | O(n) | Fails at scale |
| Sort | O(n log n) | O(n) | Loses order without extra pass |
| Counter | O(n) | O(n) | Optimal - one pass for counts |
Key insight: The bottleneck was the item in list membership check inside a loop - O(n) per check × n items = O(n²). Converting to Counter (a hash-based structure) reduces each check to O(1).
Case Study 2: Top 5 Most Frequent Words
Problem: Given a document, find the 5 most frequently occurring words. Exclude common stop words.
from collections import Counter
import re
def top_5_words_wrong(text: str, stop_words: set) -> list:
"""
Wrong approach: sort the full word list every time.
O(n log n) when we only need top 5.
"""
words = [w for w in re.findall(r'\b\w+\b', text.lower()) if w not in stop_words]
word_list = list(set(words)) # Deduplicate
word_list.sort(key=lambda w: words.count(w), reverse=True) # O(n²)! count is O(n) each call
return word_list[:5]
def top_5_words_right(text: str, stop_words: set) -> list:
"""
Correct approach: Counter + most_common.
O(n) counting + O(k + 5 log k) for top-5 via heapq.
"""
words = re.findall(r'\b\w+\b', text.lower())
filtered = (w for w in words if w not in stop_words) # generator - no intermediate list
counts = Counter(filtered) # O(n) single pass
return counts.most_common(5) # O(k + 5 log k) - heapq.nlargest internally
# Demo
text = """
Python is a versatile programming language. Python supports object-oriented
programming, functional programming, and procedural programming. Python has
a large standard library that makes Python suitable for many programming tasks.
Engineers who learn Python find Python makes them more productive.
"""
stop_words = {'a', 'an', 'the', 'is', 'and', 'that', 'for', 'who', 'them',
'more', 'find', 'makes', 'has', 'many', 'large', 'also'}
print(top_5_words_right(text, stop_words))
# [('python', 6), ('programming', 5), ('language', 1), ('versatile', 1), ('supports', 1)]
Why most_common(5) beats sorted(counter.items()):
n = total words, k = unique words
sorted(counter.items())[:5]:
- Sort all k unique words: O(k log k)
- Slice: O(1)
Total: O(k log k)
counter.most_common(5):
- heapq.nlargest(5, counter.items()): O(k + 5 log k) = O(k)
Total: O(k) - linear in unique words
For k = 100,000 unique words and n = 5:
sort: 100,000 × 17 ≈ 1.7 million operations
heapq: 100,000 + 5×17 ≈ 100,085 operations
Speedup: ~17x
Case Study 3: Implement an LRU Cache
Problem: Implement a Least Recently Used cache with O(1) get and O(1) put.
An LRU cache evicts the least recently used item when the cache is full. It requires:
- O(1) lookup by key
- O(1) move-to-front on access (mark as most recent)
- O(1) evict the oldest item
from collections import OrderedDict
from typing import Optional
class LRUCache:
"""
O(1) get and put.
Internal structure: OrderedDict maintains insertion order.
move_to_end() makes the accessed key "most recent".
popitem(last=False) removes the "least recent" key.
Memory: O(capacity) - bounded by max size.
"""
def __init__(self, capacity: int):
if capacity <= 0:
raise ValueError(f"Capacity must be positive, got {capacity}")
self.capacity = capacity
self._cache = OrderedDict() # key → value, ordered by recency
def get(self, key) -> Optional[object]:
if key not in self._cache:
return None
self._cache.move_to_end(key) # O(1) - mark as most recently used
return self._cache[key]
def put(self, key, value) -> None:
if key in self._cache:
self._cache.move_to_end(key) # Update recency
self._cache[key] = value
if len(self._cache) > self.capacity:
self._cache.popitem(last=False) # O(1) - evict least recently used
def __repr__(self):
items = list(self._cache.items())
return f"LRUCache(capacity={self.capacity}, items={items})"
# Demo - trace the cache state
cache = LRUCache(capacity=3)
cache.put("user:1", {"name": "Alice", "score": 100})
cache.put("user:2", {"name": "Bob", "score": 85})
cache.put("user:3", {"name": "Carol", "score": 92})
print(cache)
# LRUCache(capacity=3, items=[('user:1',...), ('user:2',...), ('user:3',...)])
cache.get("user:1") # Access user:1 - moves to end (most recent)
cache.put("user:4", {"name": "Dave", "score": 78}) # user:2 evicted (least recent)
print(cache.get("user:2")) # None - evicted
print(cache.get("user:1")) # {'name': 'Alice', 'score': 100} - still cached
Why not dict + list?
Alternative: dict for O(1) lookup + list for order tracking
- get: dict lookup O(1) + list.remove O(n) + list.append O(1) = O(n) total
- put: same O(n)
Fails: list.remove is O(n) - must scan to find the element
OrderedDict solution:
- get: O(1) dict lookup + O(1) move_to_end = O(1) total
- put: O(1) insert + O(1) move_to_end + O(1) popitem = O(1) total
Wins: move_to_end is O(1) - backed by a doubly-linked list of entries
Case Study 4: Find Shortest Path (Dijkstra's Algorithm)
Problem: Find the shortest path between nodes in a weighted graph.
This is the canonical case study for combining multiple data structures - each chosen for its specific role.
import heapq
from collections import defaultdict
from typing import Optional
def dijkstra(graph: dict, start: str, end: str) -> tuple[Optional[float], list]:
"""
Dijkstra's shortest path algorithm.
Data structures chosen:
- defaultdict(list): adjacency list - O(1) neighbor lookup per node
- heapq: min-heap priority queue - O(log V) push/pop for smallest distance
- dict (distances): O(1) distance lookup by node name
- set (visited): O(1) membership check to skip processed nodes
Time: O((V + E) log V) where V = vertices, E = edges
Space: O(V + E) for graph + O(V) for distances/visited
"""
# Min-heap: (distance, node, path_so_far)
heap = [(0, start, [start])]
# Best known distance to each node
distances = {start: 0}
# Nodes fully processed (their shortest path is finalized)
visited = set()
while heap:
dist, node, path = heapq.heappop(heap) # O(log V)
if node in visited:
continue # Already processed with a shorter path
visited.add(node)
if node == end:
return dist, path
for neighbor, weight in graph[node]:
if neighbor not in visited:
new_dist = dist + weight
if new_dist < distances.get(neighbor, float('inf')):
distances[neighbor] = new_dist
heapq.heappush(heap, (new_dist, neighbor, path + [neighbor]))
return None, [] # No path exists
# Build graph using defaultdict
graph = defaultdict(list)
edges = [
("A", "B", 4), ("A", "C", 2),
("B", "D", 3), ("B", "C", 1),
("C", "D", 5), ("C", "E", 8),
("D", "E", 2), ("D", "F", 6),
("E", "F", 1),
]
for u, v, w in edges:
graph[u].append((v, w))
graph[v].append((u, w)) # Undirected
distance, path = dijkstra(graph, "A", "F")
print(f"Shortest distance A → F: {distance}") # 10
print(f"Path: {' → '.join(path)}") # A → C → B → D → E → F
Why each structure was chosen:
defaultdict(list) for graph:
→ Adding edges: graph[u].append((v, w)) - no KeyError for first edge
→ Neighbor lookup: graph[node] - O(1)
→ Alternative (dict) would need: if u not in graph: graph[u] = []
heapq for priority queue:
→ Always extract the node with MINIMUM current distance
→ heappop: O(log V) - far better than scanning all V nodes O(V)
→ Alternative (sorted list): re-sorting after each insert is O(V log V) per step
dict for distances:
→ O(1) lookup and update: distances[neighbor] = new_dist
→ Alternative (list): only works if nodes are integers 0..V
set for visited:
→ O(1) membership check: if neighbor not in visited
→ Alternative (list): O(V) per check → O(V²) total = catastrophic
Case Study 5: Merge User Preferences from Multiple Configuration Sources
Problem: A web application has three sources of configuration, in priority order: (1) per-request overrides, (2) user preferences from DB, (3) global defaults. Implement efficient lookup that respects priority and stays live (changes to user prefs are immediately visible).
from collections import ChainMap
class UserPreferences:
"""
Three-layer configuration with live updates.
Uses ChainMap - no copying, O(k) lookup where k = number of layers (3).
Priority: request_overrides > user_settings > global_defaults
"""
GLOBAL_DEFAULTS = {
'theme': 'light',
'language': 'en',
'notifications': True,
'items_per_page': 20,
'timezone': 'UTC',
'currency': 'USD',
}
def __init__(self, user_id: int, user_settings: dict):
self._user_settings = user_settings # Live dict from DB
self._request_overrides = {} # Cleared per request
self._config = ChainMap(
self._request_overrides,
self._user_settings,
self.GLOBAL_DEFAULTS
)
def get(self, key: str):
return self._config.get(key)
def set_request_override(self, key: str, value):
"""Override for this request only. Clears itself at request end."""
self._request_overrides[key] = value
def clear_request_overrides(self):
self._request_overrides.clear()
def update_user_setting(self, key: str, value):
"""Persistent change - updates live dict, visible via ChainMap immediately."""
self._user_settings[key] = value
def effective_config(self) -> dict:
"""Full resolved config - for debugging/logging."""
return dict(self._config)
# Demo
user_db_settings = {'theme': 'dark', 'language': 'fr', 'items_per_page': 50}
prefs = UserPreferences(user_id=42, user_settings=user_db_settings)
print(prefs.get('theme')) # 'dark' - from user settings
print(prefs.get('timezone')) # 'UTC' - from global defaults
print(prefs.get('currency')) # 'USD' - from global defaults
# Admin temporarily overrides for A/B test - request scope only
prefs.set_request_override('items_per_page', 100)
print(prefs.get('items_per_page')) # 100 - request override wins
prefs.clear_request_overrides()
print(prefs.get('items_per_page')) # 50 - back to user setting
# User updates their preference - immediately visible (no cache invalidation needed)
prefs.update_user_setting('notifications', False)
print(prefs.get('notifications')) # False - user setting now shadows default
Why ChainMap beats dict merging:
dict merging ({**defaults, **user, **overrides}):
- Creates a copy at merge time - O(n) where n = total keys
- Updating user_settings does NOT update the merged dict
- Must re-merge on every update - O(n) per update
- request_overrides must be re-merged for every request
ChainMap:
- No copy - just stores references to the three dicts: O(1) setup
- Updates to user_settings or request_overrides visible immediately
- Lookup: O(k) where k = 3 (number of layers) - effectively O(1)
- Adding/removing a layer: O(1) via new_child() or parents
Part 5 - Space vs Time Tradeoffs
The Fundamental Engineering Tradeoff
Memory and computation time exist in tension. Choosing a data structure is often choosing where you sit on this curve:
When to Buy Memory for Speed
# SCENARIO: Check if a document has any profanity word
# Called 10 million times per day
# Naive: linear scan each call
banned_words_list = ["spam", "scam", "fraud", ...] # 10,000 words
def is_clean_slow(text: str) -> bool:
words = text.lower().split()
for word in words:
if word in banned_words_list: # O(n) - scans 10,000 words each time!
return False
return True
# Cost: O(W × B) where W = words in text, B = banned list size
# Better: convert to set once, use O(1) lookup
banned_words_set = set(banned_words_list) # O(B) one-time cost, ~400KB memory
def is_clean_fast(text: str) -> bool:
words = text.lower().split()
return not any(word in banned_words_set for word in words) # O(W)
# Cost: O(W) - B no longer matters
# Extra memory used: set overhead ≈ 400KB for 10,000 words
# Time saved: 10M calls × (10,000 - 1) operations = 100 BILLION operations saved per day
# Decision: obvious - spend 400KB to save trillions of operations
Memoization: Trading Space for Repeated Computation
from functools import lru_cache
# Fibonacci without memoization: O(2ⁿ) exponential time
def fib_slow(n):
if n <= 1: return n
return fib_slow(n-1) + fib_slow(n-2)
# fib_slow(40) calls fib(1) about 100 million times
# With lru_cache (memoization dict internally)
@lru_cache(maxsize=None)
def fib_fast(n):
if n <= 1: return n
return fib_fast(n-1) + fib_fast(n-2)
# fib_fast(40) calls each sub-problem ONCE: O(n) time, O(n) space
# Manual memoization for full control
def fibonacci(n: int, memo: dict = None) -> int:
if memo is None:
memo = {}
if n in memo:
return memo[n]
if n <= 1:
return n
result = fibonacci(n-1, memo) + fibonacci(n-2, memo)
memo[n] = result # Store result - dict O(1) write
return result
When NOT to Trade Space for Time
# Bad tradeoff: caching everything when memory is constrained
# and data is queried only once
# This caches ALL results in memory - wasteful if:
# 1. Queries are unique (no repeated keys)
# 2. Data is too large to fit in RAM
# 3. Cache invalidation is complex
all_results_cache = {} # Grows unbounded
def get_report(date: str) -> dict:
if date not in all_results_cache:
all_results_cache[date] = compute_expensive_report(date)
return all_results_cache[date]
# Fix: use LRU cache with bounded size
from functools import lru_cache
@lru_cache(maxsize=128) # Keeps only last 128 results
def get_report_bounded(date: str) -> dict:
return compute_expensive_report(date)
:::tip The Tradeoff Rule Buy memory for speed when: (1) the operation is called frequently (n > 1000 times), (2) the memory cost is bounded (set/dict grows proportional to unique elements, not total calls), (3) the time savings dominate the memory cost. Don't buy memory for speed when: data is queried once, the dataset doesn't fit in RAM, or cache invalidation complexity exceeds the savings. :::
Part 6 - The Anti-Patterns
Anti-Pattern 1: list.index() When You Need O(1) Lookup
# WRONG: O(n) lookup, called n times → O(n²)
user_ids = [101, 202, 303, 404, 505]
incoming_ids = [303, 101, 999, 404]
for uid in incoming_ids:
try:
pos = user_ids.index(uid) # O(n) scan!
print(f"Found {uid} at position {pos}")
except ValueError:
print(f"{uid} not found")
# RIGHT: dict for O(1) position lookup
uid_to_position = {uid: i for i, uid in enumerate(user_ids)} # O(n) once
for uid in incoming_ids:
if uid in uid_to_position: # O(1)
print(f"Found {uid} at position {uid_to_position[uid]}")
else:
print(f"{uid} not found")
Anti-Pattern 2: list for Membership When set Works
# WRONG: O(n) membership check in a loop → O(n²)
processed = []
for item in large_dataset:
if item not in processed: # O(n) scan!
process(item)
processed.append(item)
# RIGHT: O(1) membership check with set
processed = set()
for item in large_dataset:
if item not in processed: # O(1)
process(item)
processed.add(item)
# Note: set doesn't preserve order.
# If you need insertion order AND O(1) membership:
processed_set = set()
processed_ordered = []
for item in large_dataset:
if item not in processed_set:
process(item)
processed_set.add(item)
processed_ordered.append(item)
Anti-Pattern 3: list When Deque Is Needed for Front Operations
# WRONG: O(n²) total - pop(0) is O(n) each time
def process_queue_wrong(items: list) -> list:
queue = items.copy()
results = []
while queue:
item = queue.pop(0) # O(n) - shifts all elements left!
results.append(item * 2)
return results
# RIGHT: O(n) total - popleft() is O(1)
from collections import deque
def process_queue_right(items: list) -> list:
queue = deque(items)
results = []
while queue:
item = queue.popleft() # O(1)
results.append(item * 2)
return results
# Benchmark
import time
data = list(range(100_000))
start = time.time()
process_queue_wrong(data)
print(f"list.pop(0): {time.time()-start:.3f}s") # ~2.5s
start = time.time()
process_queue_right(data)
print(f"deque.popleft(): {time.time()-start:.4f}s") # ~0.008s
Anti-Pattern 4: Tuple When List Is Needed
# WRONG: tuples are immutable - cannot append in a loop
def build_report_wrong(events):
report = ()
for event in events:
report = report + (event,) # Creates a new tuple each time - O(n²) total!
return report
# RIGHT: use list for accumulation, convert to tuple at the end if needed
def build_report_right(events):
report = []
for event in events:
report.append(event) # O(1) amortized
return tuple(report) # O(n) conversion at the end
Anti-Pattern 5: Sorting When Only Top-K Is Needed
from collections import Counter
import heapq
data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 9, 2]
# WRONG: O(n log n) full sort just to get top 3
top3_wrong = sorted(set(data), reverse=True)[:3] # Sort all k unique
# RIGHT: O(n + k log n) heapq.nlargest - much faster for large n, small k
top3_right = heapq.nlargest(3, set(data))
# For frequency: Counter.most_common(k) uses heapq internally
freq = Counter(data)
top3_frequent = freq.most_common(3) # O(n) count + O(k + 3 log k) = O(n)
print(top3_frequent) # [(5, 3), (9, 2), (1, 2)]
Part 7 - Interview Framework: State, Justify, Analyze
In technical interviews, how you articulate your data structure choice matters as much as the choice itself. Use this three-part framework:
The SJA Framework
S - STATE the data structure you will use
J - JUSTIFY why (which operation it optimizes)
A - ANALYZE the complexity (time and space)
Example Interview Exchange
Q: "Given a list of n integers, find all pairs that sum to a target T."
Poor answer: "I'd use a set, then iterate."
Strong answer using SJA:
State: I'll use a
setas a lookup table alongside the iteration.Justify: For each element
x, I need to check ifT - xexists in the data. Using alist, that check is O(n) per element - O(n²) total. Asetstores elements in a hash table, giving O(1) average-case membership testing.Analyze: Time: O(n) to build the set, O(n) for the loop with O(1) checks per element → O(n) total. Space: O(n) for the set. This is optimal - we must look at each element at least once.
def find_pairs(nums: list, target: int) -> list:
"""O(n) time, O(n) space. Set for O(1) complement lookup."""
seen = set()
pairs = []
for x in nums:
complement = target - x
if complement in seen: # O(1) - hash lookup
pairs.append((complement, x))
seen.add(x)
return pairs
print(find_pairs([2, 7, 11, 15, 4, 3], target=9))
# [(2, 7), (4, 5)]... depends on input - but always O(n)
Anti-Patterns in Interviews
WRONG: "I'll use a list and check if each element is in it."
→ Signals you don't know set membership is O(1)
WRONG: "I'll sort it first." (when you don't need the order)
→ Unnecessary O(n log n) when O(n) is possible
WRONG: "I'll use a dict." (for every problem)
→ No justification - sounds like guessing
WRONG: Choosing the structure, then justifying backward
→ Should be: analyze operations first, then choose
Part 8 - Module Synthesis: How the Lessons Connect
This lesson is the capstone of Module 4. Each previous lesson taught one structure deeply. Now you see how they fit together:
Module 4 - Data Structures Pythonic: The Full Picture
Lesson 01 - Lists Internals
Taught: Dynamic array, pointer indirection, O(1) append amortized
Use when: Ordered sequence, random access, general-purpose accumulation
Lesson 02 - List Comprehensions and Generators
Taught: Lazy evaluation, memory efficiency for large sequences
Use when: Building filtered/transformed collections without memory spikes
Lesson 06 - Dictionaries Deep Dive
Taught: Hash tables, O(1) lookup, collision handling, insertion-order
Use when: Key → value mapping, caching, counting (see Counter)
Lesson 08 - Sets and Frozensets
Taught: Hash-based membership, mathematical set operations
Use when: Deduplication, membership testing, set algebra
Lesson 10 - Tuples and Memory
Taught: Immutability, hashability, lower memory vs list
Use when: Fixed records, dict keys, return values, namedtuple base
Lesson 11 - Strings and Unicode
Taught: Immutable string internals, encoding
Applies: str is hashable - valid Counter/dict key, set element
Lesson 13 - Deque and Collections
Taught: O(1) both-ends ops, maxlen rotating buffer, ChainMap, namedtuple
Use when: FIFO/BFS, front insertions, sliding window, config layers
Lesson 14 - Counter and Defaultdict
Taught: Frequency counting, default factory, KeyError elimination
Use when: Counting, grouping, top-k, multiset operations
Lesson 15 (THIS) - Selection Strategy
Teaches: The decision framework that unifies all of the above
The Unified Decision Table
# Cheat-sheet: problem statement keywords → data structure
KEYWORD_TO_STRUCTURE = {
"frequency / count occurrences": "Counter",
"most common / top k": "Counter.most_common(k)",
"group by / one-to-many": "defaultdict(list)",
"membership / contains / seen before": "set",
"deduplication": "set",
"ordered sequence / random access": "list",
"front insertion / BFS queue / FIFO": "deque",
"sliding window / last N items": "deque(maxlen=k)",
"immutable record / tuple with names": "typing.NamedTuple",
"priority / always get minimum": "heapq",
"shortest path": "heapq + dict + defaultdict",
"LRU cache / move to most recent": "OrderedDict",
"layered config / scoped lookup": "ChainMap",
"key-value with defaults": "defaultdict",
"fixed config / read-only mapping": "dict (plain)",
"hashable record / dict key": "tuple or frozenset",
}
Interview Questions
Q1: Someone shows you code using list.index() inside a loop over n elements. What is the time complexity, and how do you fix it?
Answer: list.index(x) performs a linear scan - O(n). Called n times inside a loop makes the total complexity O(n²). The fix depends on what you need: if you need existence checking, convert the list to a set (O(n) to build, O(1) per check → O(n) total). If you need the position by value, build a dict mapping {value: index} from the list - O(n) to build, O(1) lookup → O(n) total. The key insight: whenever you call a linear-scan operation inside a loop, you have an implicit O(n²) that needs to be refactored using a hash-based structure.
Q2: You need to find the k most frequent elements in a list of n elements. What data structure and algorithm would you use? What is the complexity?
Answer: Use Counter to count frequencies - O(n) time, O(k) space where k is the number of unique elements. Then use Counter.most_common(k) which internally calls heapq.nlargest(k, counter.items()) - O(unique_k + k log unique_k) time. Total: O(n + unique_k). For k much smaller than unique elements, this is significantly faster than full sorting O(unique_k log unique_k). Alternative using only heapq: heapq.nlargest(k, counts.items(), key=lambda x: x[1]) directly - same complexity.
Q3: When is it correct to use a list for something you'd normally use a set for?
Answer: When any of these apply: (1) You need to preserve insertion order and cannot lose duplicates - set doesn't preserve order and eliminates duplicates. (2) The data is unhashable (contains lists, dicts, other mutable objects) - sets require hashable elements. (3) The dataset is very small (< 20 elements) - the constant overhead of hash table vs linear scan is negligible at small scales, and a list may be simpler. (4) You need multiple occurrences of the same element (multiset) - use Counter instead of set. (5) You need positional access data[i] - sets don't support indexing.
Q4: How would you choose between heapq and sorted() for a problem that requires repeatedly getting the minimum element as new items are added?
Answer: Use heapq. sorted() re-sorts the entire list - O(n log n) per sort. Each time you add a new item and want the new minimum, you'd re-sort - O(n log n) per insertion. With heapq: heappush is O(log n) per insertion, heappop (get minimum) is O(log n). For n insertions + n extractions, heapq costs O(n log n) total vs sorted() costs O(n² log n). Use sorted() only when you need the full sorted order once and the data is static. Use heapq when you repeatedly insert and extract minimum - i.e., an online or streaming setting.
Q5: Design a data structure to answer "how many distinct values appeared in the last k events" efficiently.
Answer: Use deque(maxlen=k) for the sliding window and a Counter for frequency tracking. When an element enters the window: counter[x] += 1. When an element leaves (evicted from deque): counter[x] -= 1; if counter[x] == 0: del counter[x]. The number of distinct values is len(counter). Each operation is O(1) amortized. Space: O(k) for the window.
from collections import deque, Counter
class DistinctInWindow:
def __init__(self, k: int):
self.window = deque(maxlen=k)
self.counts = Counter()
def add(self, x) -> int:
if len(self.window) == self.window.maxlen:
evicted = self.window[0] # About to be evicted
self.counts[evicted] -= 1
if self.counts[evicted] == 0:
del self.counts[evicted]
self.window.append(x)
self.counts[x] += 1
return len(self.counts) # Distinct count
tracker = DistinctInWindow(k=3)
for x in [1, 2, 1, 3, 4, 1]:
print(f"Add {x}: {tracker.add(x)} distinct in last 3")
# Add 1: 1 (window: [1])
# Add 2: 2 (window: [1,2])
# Add 1: 2 (window: [1,2,1])
# Add 3: 3 (window: [2,1,3])
# Add 4: 3 (window: [1,3,4])
# Add 1: 3 (window: [3,4,1])
Q6: In an interview, a candidate says "I'll use a dict for this." What follow-up questions should reveal whether they truly understand the choice?
Answer: Strong follow-ups: (1) "What is the expected size of the dict, and how many unique keys?" - tests awareness that hash tables have O(n) worst-case when many collisions occur. (2) "Do you need the keys in insertion order? Python 3.7+ guarantees it, but does the algorithm rely on it?" - tests awareness of dict ordering. (3) "What happens if the same key is inserted twice - what should the dict store?" - tests whether they need defaultdict, Counter, or a custom merge strategy. (4) "Is the dict shared across threads?" - tests awareness that dict operations are not atomic (GIL nuances). (5) "What is your eviction or cleanup strategy?" - unbounded dicts are a memory leak; tests whether they need an LRU cache or TTL mechanism.
Practice Challenges
Level 1 - Choose the Structure
For each scenario, name the correct Python data structure and justify in one sentence:
- You receive a stream of IP addresses. After every 1000 IPs, report which IP appeared most often.
- You are building a web crawler. Track which URLs you have already visited.
- Store a user's
(first_name, last_name, age)for an immutable audit log entry. - Implement a print job queue where the highest-priority job always prints next.
- Track the last 30 HTTP response times for a latency dashboard.
Solution
-
Counter- frequency counting in O(n),most_common(1)in O(k) using heapq. -
set- O(1) membership test (url in visited), O(1) insert. A list would be O(n) per check → O(n²) total for n URLs. -
typing.NamedTuple- immutable (audit integrity), hashable, named attributes for clarity, memory-efficient (no__dict__). -
heapq-heappush(queue, (priority, job))O(log n),heappop(queue)O(log n), always returns minimum-priority (highest priority) job. A sorted list would be O(n) per insert. -
deque(maxlen=30)- O(1) append; automatically evicts the oldest when full;sum(window) / len(window)for average in O(30) = O(1). No manual index management needed.
Level 2 - Refactor the O(n²) Code
The following function ships to production and works for small inputs. Find the O(n²) bottleneck and rewrite it to O(n).
def process_events(events: list) -> dict:
"""
Given a list of (user_id, event_type) tuples,
return for each user_id: count of each event_type they triggered.
"""
result = {}
for user_id, event_type in events:
if user_id not in result:
result[user_id] = {}
if event_type not in result[user_id]:
result[user_id][event_type] = 0
result[user_id][event_type] += 1
return result
# This is actually O(n) - but can we make it cleaner AND add a method
# to get the top event type per user?
Solution
The original is already O(n) in time complexity, but the code is unnecessarily verbose and the nested dict initialization pattern is a defaultdict use case. Refactoring for clarity and adding the top-event feature:
from collections import defaultdict, Counter
def process_events_clean(events: list) -> defaultdict:
"""
O(n) time, O(u × e) space where u = unique users, e = unique event types.
Uses defaultdict(Counter) - inner Counter handles missing event_types.
"""
result = defaultdict(Counter)
for user_id, event_type in events:
result[user_id][event_type] += 1 # defaultdict handles outer missing key
# Counter handles inner missing event_type
return result
def top_event_per_user(events: list) -> dict:
"""For each user, return their most frequent event type."""
aggregated = process_events_clean(events)
return {
user_id: counter.most_common(1)[0] # (event_type, count) tuple
for user_id, counter in aggregated.items()
}
sample = [
(1, "login"), (2, "click"), (1, "purchase"), (1, "login"),
(2, "login"), (3, "click"), (2, "click"), (1, "login"),
]
print(dict(process_events_clean(sample)))
# {1: Counter({'login': 3, 'purchase': 1}),
# 2: Counter({'click': 2, 'login': 1}),
# 3: Counter({'click': 1})}
print(top_event_per_user(sample))
# {1: ('login', 3), 2: ('click', 2), 3: ('click', 1)}
The defaultdict(Counter) pattern is idiomatic for this class of problem: outer defaultdict eliminates the outer key existence check, inner Counter eliminates the inner key existence check AND gives most_common() for free.
Level 3 - Design a Multi-Source Recommendation Engine
Design a system that:
- Maintains a catalogue of items with metadata (
id,name,score,category) - Supports O(1) lookup by item ID
- Returns the top 10 highest-score items in a given category efficiently
- Supports fast "is this item in the user's seen list?" checks
- Processes a stream of user events to update item scores in real time
Describe your data structure choices for each requirement. Then implement a minimal working version.
Solution
Data structure choices:
Requirement 1 (O(1) lookup by ID):
→ dict: {item_id: item_dict} - hash-based O(1)
Requirement 2 (top 10 per category):
→ defaultdict(list) of (score, item_id) pairs + heapq per category
→ Or: defaultdict(Counter) if scores are integers
→ For real-time updates: maintain a max-heap per category
Requirement 3 (user seen list, fast membership):
→ set per user: {item_id, ...} - O(1) membership check
→ NOT list - O(n) per check would be O(n²) for n items checked
Requirement 4 (update scores from event stream):
→ dict[item_id]["score"] += delta - O(1) update
→ Rebuild heap lazily (invalidate on update, rebuild when queried)
→ Or use sortedcontainers.SortedList for O(log n) insert+delete
Minimal implementation:
from collections import defaultdict
from typing import NamedTuple, Optional
import heapq
class Item(NamedTuple):
item_id: int
name: str
category: str
score: float = 0.0
class RecommendationEngine:
"""
Catalogue with O(1) lookup, O(log n) score updates, O(k + n log k) top-k queries.
"""
def __init__(self):
self._items: dict[int, dict] = {} # item_id → mutable item dict
self._by_category: defaultdict[str, list] = defaultdict(list) # category → [(score, id)]
self._user_seen: defaultdict[int, set] = defaultdict(set) # user_id → {item_ids}
self._dirty_categories: set[str] = set() # Categories needing heap rebuild
def add_item(self, item_id: int, name: str, category: str, score: float = 0.0):
self._items[item_id] = {'id': item_id, 'name': name,
'category': category, 'score': score}
heapq.heappush(self._by_category[category], (-score, item_id))
def update_score(self, item_id: int, delta: float):
"""O(1) update. Marks category as dirty for lazy heap rebuild."""
if item_id not in self._items:
raise KeyError(f"Unknown item: {item_id}")
self._items[item_id]['score'] += delta
category = self._items[item_id]['category']
self._dirty_categories.add(category)
def _rebuild_heap(self, category: str):
"""O(k) where k = items in category. Called lazily."""
items_in_cat = [
(-self._items[iid]['score'], iid)
for iid in set(iid for _, iid in self._by_category[category])
if self._items[iid]['category'] == category
]
heapq.heapify(items_in_cat)
self._by_category[category] = items_in_cat
self._dirty_categories.discard(category)
def top_k(self, category: str, k: int = 10, user_id: Optional[int] = None) -> list:
"""
Returns top k items in category, optionally filtering seen items.
O(k log n) where n = items in category.
"""
if category in self._dirty_categories:
self._rebuild_heap(category)
heap = list(self._by_category[category])
seen = self._user_seen.get(user_id, set()) if user_id else set()
results = []
while heap and len(results) < k:
neg_score, item_id = heapq.heappop(heap)
if item_id not in seen and item_id in self._items:
results.append(self._items[item_id])
return results
def mark_seen(self, user_id: int, item_id: int):
"""O(1) - set add."""
self._user_seen[user_id].add(item_id)
def is_seen(self, user_id: int, item_id: int) -> bool:
"""O(1) - set membership."""
return item_id in self._user_seen[user_id]
# Demo
engine = RecommendationEngine()
engine.add_item(1, "Python for Engineers", "python", score=4.8)
engine.add_item(2, "Data Structures Deep Dive", "python", score=4.6)
engine.add_item(3, "ML Fundamentals", "ml", score=4.9)
engine.add_item(4, "Advanced Python", "python", score=4.7)
engine.mark_seen(user_id=42, item_id=1)
engine.update_score(item_id=2, delta=0.2) # Score now 4.8
top = engine.top_k("python", k=2, user_id=42)
print([item['name'] for item in top])
# ['Data Structures Deep Dive', 'Advanced Python']
# (item 1 excluded - user 42 already saw it)
Summary of choices:
dictfor O(1) item lookup by IDdefaultdict(list)+heapqper category for O(log n) inserts and O(k log n) top-kdefaultdict(set)for per-user seen lists with O(1)addand O(1)incheckssetfor dirty category tracking - O(1) insert, O(1) discard- Lazy heap rebuild - avoids O(n log n) rebuild on every score update
Quick Reference
| Problem Pattern | Data Structure | Key Operation | Complexity |
|---|---|---|---|
| Ordered sequence | list | append, a[i] | O(1) access, O(1)* append |
| Front insertions / BFS | deque | appendleft, popleft | O(1) both ends |
| Sliding window last-k | deque(maxlen=k) | append (auto-evicts) | O(1) per element |
| Key → value lookup | dict | d[k], k in d | O(1) average |
| Membership testing | set | x in s | O(1) average |
| Deduplication | set(iterable) | constructor | O(n) |
| Frequency counting | Counter | Counter(iterable) | O(n) |
| Top-k by frequency | Counter.most_common(k) | heapq internally | O(n + k log n) |
| Grouping by key | defaultdict(list) | dd[k].append(v) | O(1) per insert |
| Priority / min | heapq | heappush, heappop | O(log n) |
| Immutable record | NamedTuple | attribute access | O(1) |
| Layered config | ChainMap | chain[key] | O(layers) |
| LRU cache | OrderedDict | move_to_end | O(1) |
| Memoization | dict / lru_cache | cache lookup | O(1) after first call |
| Shortest path | heapq + dict + defaultdict | Dijkstra | O((V+E) log V) |
Key Takeaways
- Data structure selection is an architectural decision driven by operation frequency - identify which operations dominate, then choose the structure where those operations are cheapest
- The decision flowchart: ordered sequence → list or deque; key-value → dict; membership → set; immutable record → NamedTuple; priority → heapq; frequency → Counter; grouping → defaultdict(list); sliding window → deque(maxlen); config layering → ChainMap; LRU → OrderedDict
- Space-time tradeoff is always explicit - hash-based structures (set, dict, Counter) trade O(n) memory for O(1) lookup; always calculate whether the time savings justify the memory cost
- The most common production anti-patterns are
list.index()in a loop (O(n²)),x in listinside loops (O(n²)),list.pop(0)in a queue pattern (O(n²)), and full sort when only top-k is needed - all fixed by a single structure change - In interviews, always state the structure, justify with the dominant operation, then analyze time and space - this demonstrates engineering reasoning, not memorization
- The structures in this module are not alternatives to each other - they are composable: Dijkstra uses heapq + dict + defaultdict simultaneously; recommendation engines combine dict + defaultdict + heapq; LRU cache combines dict + doubly-linked list (via OrderedDict)
Counter.most_common(k)uses heapq.nlargest internally - O(n + k log n) - significantly faster than sorting whenk << total unique elements; understanding internals lets you reason about when stdlib functions are already optimal
