Data Structure Selection Strategy - Engineering the Right Choice

Reading time: ~22 minutes | Level: Foundation → Engineering

Here is code that works correctly but will destroy production at scale.

def find_duplicates(records: list) -> list:
    """Find all records that appear more than once."""
    duplicates = []
    for i, record in enumerate(records):
        if record in records[i+1:]:     # O(n) membership check!
            if record not in duplicates:  # O(n) membership check!
                duplicates.append(record)
    return duplicates

# Test
data = ["alpha", "beta", "alpha", "gamma", "beta", "alpha"]
print(find_duplicates(data))   # ['alpha', 'beta']  ← correct output

The output is correct. The code is a disaster.

Three nested O(n) operations inside a loop creates O(n³) behavior. On a list of 10,000 records, that is 10³ × 10³ × 10³ = 1 trillion operations. On a list of 1,000,000 records it would take years.

The fix is not a better algorithm - it is choosing the right data structure:

def find_duplicates_fast(records: list) -> list:
    """Find all records that appear more than once. O(n) time, O(n) space."""
    from collections import Counter
    counts = Counter(records)
    return [record for record, count in counts.items() if count > 1]

data = ["alpha", "beta", "alpha", "gamma", "beta", "alpha"]
print(find_duplicates_fast(data))  # ['alpha', 'beta']

Same output. O(n) time. The only change was knowing which data structure to reach for.

This lesson teaches you the framework for making that decision correctly - every time.

What You Will Learn

The engineering decision framework: operations × frequency → complexity budget
A complete ASCII decision flowchart for choosing between Python's data structures
How to read a problem statement and immediately map it to the right structure
Five real-world case studies with analysis of wrong vs correct choices
Space-time tradeoffs: when buying memory to save time is the right engineering call
How to state a data structure choice in an interview - structure, justify, analyze
The most common anti-patterns - and exactly why they fail at scale
How the lessons in this module connect: list internals → dict internals → sets → deque → Counter → final synthesis

Prerequisites

All prior lessons in Module 4 (Data Structures - Pythonic):
- Lesson 01: Lists Internals
- Lesson 06: Dictionaries Deep Dive
- Lesson 08: Sets and Frozensets
- Lesson 10: Tuples and Memory
- Lesson 13: Deque and Collections
- Lesson 14: Counter and Defaultdict
Basic Big-O notation: O(1), O(log n), O(n), O(n log n), O(n²)
Python heapq module (introduced in this lesson)

Part 1 - The Decision Framework

Operations × Frequency → Complexity Budget

Every data structure decision starts with a single question: what operations will I perform most frequently?

The cost of your program is dominated by:

Total cost ≈ Σ (operation_frequency × operation_cost)

If you perform an O(n) operation once, that is fine. If you perform it n times in a loop, you have O(n²). The data structure must match the dominant operation pattern.

The Four Questions to Ask First

Before writing a single line of code, answer these:

1. OPERATIONS:   What will I do to the data most often?
                 Read? Write? Search? Add at front? Prioritize?

2. FREQUENCY:    How many times? Once? Per-request? Per-element in a loop?

3. SIZE:         How large can the data grow?
                 100 elements? 1 million? Unbounded stream?

4. MUTABILITY:   Does the data change after creation?
                 Immutable → tuple/namedtuple
                 Mutable → list/dict/set/deque

Part 2 - The Decision Flowchart

Read this top-to-bottom. Stop at the first matching node.

Special-purpose structures:

Use case	Structure
Priority / min / max tracking	`heapq` (min-heap)
Counting frequencies	`Counter`
Grouping (one-to-many)	`defaultdict(list)`
Sliding window (last k elements)	`deque(maxlen=k)`
Layered config / scopes	`ChainMap`
LRU cache / order-sensitive	`OrderedDict`

Part 3 - Structure-to-Use-Case Reference

Complete Complexity Matrix

Structure	Access	Insert	Delete	Search / Membership
`list`	O(1) by index	O(1)* end / O(n) front	O(n)	O(n) - linear scan; use set for repeated checks
`tuple`	O(1) by index	N/A (immutable)	N/A (immutable)	O(n) - linear scan
`dict`	O(1)* by key	O(1)* amortized	O(1)* amortized	O(1)* key lookup via hash; O(n) value search
`set`	N/A	O(1)* amortized	O(1)* amortized	O(1)* hash-based - the whole point of set
`deque`	O(n) by index	O(1) both ends	O(1) both ends	O(n) - not designed for search
`heapq` (list)	O(1) min only	O(log n)	O(log n) for min	O(n) - use dict alongside for lookups
`Counter`	O(1)* by key	O(1)*	O(1)*	O(1)* - same as dict; missing key returns 0
`defaultdict`	O(1)* by key	O(1)*	O(1)*	O(1)* - same as dict; missing key triggers factory
`OrderedDict`	O(1)* by key	O(1)*	O(1)*	O(1)* - hash + order tracking; `move_to_end()` is O(1)
`ChainMap`	O(k)* k=layers	O(1)* to first	O(1)* first map	O(k)* - k = number of maps; walks each map in order

* Average case. Amortized where noted.

Quick Selection Guide

# ORDERED SEQUENCE: use list or deque
ordered_data = [1, 2, 3, 4, 5]         # Random access → list
front_heavy = deque([1, 2, 3])          # Front ops → deque

# KEY-VALUE LOOKUP: use dict
user_profile = {'id': 42, 'name': 'Alice', 'role': 'admin'}

# MEMBERSHIP TEST AT SCALE: use set
allowed_ips = {'192.168.1.1', '10.0.0.1', '172.16.0.0'}
if request_ip in allowed_ips:           # O(1), not O(n)
    pass

# IMMUTABLE RECORD: use tuple or NamedTuple
from typing import NamedTuple
class Config(NamedTuple):
    host: str
    port: int = 5432

# PRIORITY / MIN TRACKING: use heapq
import heapq
tasks = []
heapq.heappush(tasks, (1, "high priority"))
heapq.heappush(tasks, (5, "low priority"))
next_task = heapq.heappop(tasks)        # Always returns min: (1, "high priority")

# FREQUENCY COUNTING: use Counter
from collections import Counter
freq = Counter(["a", "b", "a", "c", "a", "b"])
print(freq.most_common(2))              # [('a', 3), ('b', 2)]

# GROUPING (one-to-many): use defaultdict(list)
from collections import defaultdict
groups = defaultdict(list)
for k, v in [("x", 1), ("y", 2), ("x", 3)]:
    groups[k].append(v)

# SLIDING WINDOW: use deque(maxlen=k)
window = deque(maxlen=3)
for val in [1, 2, 3, 4, 5]:
    window.append(val)
    print(list(window))                 # Always last 3 elements

Part 4 - Five Real-World Case Studies

Case Study 1: Find All Duplicates in a List

Problem: Given a list of n elements, find all elements that appear more than once. Preserve the order of first occurrence in the output.

Analysis of Approaches:

import time

data = list(range(10_000)) + list(range(5_000))  # 5000 duplicates
import random; random.shuffle(data)

# Approach 1: Nested loops - O(n²)
def find_dups_nested(lst):
    result = []
    for i, item in enumerate(lst):
        if item in lst[i+1:]:           # O(n) slice + O(n) search = O(n)
            if item not in result:      # O(n) search
                result.append(item)
    return result
# n=15000 → n² ≈ 225 million comparisons - takes seconds

# Approach 2: Sort + compare neighbors - O(n log n)
def find_dups_sort(lst):
    sorted_lst = sorted(lst)            # O(n log n)
    dups = set()
    for i in range(1, len(sorted_lst)):
        if sorted_lst[i] == sorted_lst[i-1]:
            dups.add(sorted_lst[i])
    # But: this loses original order information
    return [x for x in lst if x in dups]  # O(n) pass to restore order

# Approach 3: Counter - O(n)
from collections import Counter
def find_dups_counter(lst):
    counts = Counter(lst)               # O(n) - one pass
    seen_dups = {k for k, v in counts.items() if v > 1}  # O(k) where k=unique
    return [x for x in lst if x in seen_dups][:len(seen_dups)]  # first occurrences

# Better: preserve insertion order without a second pass
def find_dups_optimal(lst):
    """O(n) time, O(n) space. Preserves first-occurrence order."""
    counts = Counter(lst)               # O(n) - one pass for all counts
    seen = set()
    result = []
    for item in lst:                    # O(n) - one pass to collect in order
        if counts[item] > 1 and item not in seen:
            result.append(item)
            seen.add(item)
    return result

print(find_dups_optimal(data)[:5])   # First 5 duplicates (in order)

Complexity comparison:

Approach	Time	Space	Notes
Nested loops	O(n²)	O(n)	Fails at scale
Sort	O(n log n)	O(n)	Loses order without extra pass
Counter	O(n)	O(n)	Optimal - one pass for counts

Key insight: The bottleneck was the item in list membership check inside a loop - O(n) per check × n items = O(n²). Converting to Counter (a hash-based structure) reduces each check to O(1).

Case Study 2: Top 5 Most Frequent Words

Problem: Given a document, find the 5 most frequently occurring words. Exclude common stop words.

from collections import Counter
import re

def top_5_words_wrong(text: str, stop_words: set) -> list:
    """
    Wrong approach: sort the full word list every time.
    O(n log n) when we only need top 5.
    """
    words = [w for w in re.findall(r'\b\w+\b', text.lower()) if w not in stop_words]
    word_list = list(set(words))         # Deduplicate
    word_list.sort(key=lambda w: words.count(w), reverse=True)  # O(n²)! count is O(n) each call
    return word_list[:5]

def top_5_words_right(text: str, stop_words: set) -> list:
    """
    Correct approach: Counter + most_common.
    O(n) counting + O(k + 5 log k) for top-5 via heapq.
    """
    words = re.findall(r'\b\w+\b', text.lower())
    filtered = (w for w in words if w not in stop_words)  # generator - no intermediate list
    counts = Counter(filtered)           # O(n) single pass
    return counts.most_common(5)         # O(k + 5 log k) - heapq.nlargest internally

# Demo
text = """
Python is a versatile programming language. Python supports object-oriented
programming, functional programming, and procedural programming. Python has
a large standard library that makes Python suitable for many programming tasks.
Engineers who learn Python find Python makes them more productive.
"""

stop_words = {'a', 'an', 'the', 'is', 'and', 'that', 'for', 'who', 'them',
              'more', 'find', 'makes', 'has', 'many', 'large', 'also'}

print(top_5_words_right(text, stop_words))
# [('python', 6), ('programming', 5), ('language', 1), ('versatile', 1), ('supports', 1)]

Why most_common(5) beats sorted(counter.items()):

n = total words, k = unique words

sorted(counter.items())[:5]:
  - Sort all k unique words: O(k log k)
  - Slice: O(1)
  Total: O(k log k)

counter.most_common(5):
  - heapq.nlargest(5, counter.items()): O(k + 5 log k) = O(k)
  Total: O(k) - linear in unique words

For k = 100,000 unique words and n = 5:
  sort:      100,000 × 17 ≈ 1.7 million operations
  heapq:     100,000 + 5×17 ≈ 100,085 operations
  Speedup: ~17x

Case Study 3: Implement an LRU Cache

Problem: Implement a Least Recently Used cache with O(1) get and O(1) put.

An LRU cache evicts the least recently used item when the cache is full. It requires:

O(1) lookup by key
O(1) move-to-front on access (mark as most recent)
O(1) evict the oldest item

from collections import OrderedDict
from typing import Optional

class LRUCache:
    """
    O(1) get and put.

    Internal structure: OrderedDict maintains insertion order.
    move_to_end() makes the accessed key "most recent".
    popitem(last=False) removes the "least recent" key.

    Memory: O(capacity) - bounded by max size.
    """
    def __init__(self, capacity: int):
        if capacity <= 0:
            raise ValueError(f"Capacity must be positive, got {capacity}")
        self.capacity = capacity
        self._cache = OrderedDict()     # key → value, ordered by recency

    def get(self, key) -> Optional[object]:
        if key not in self._cache:
            return None
        self._cache.move_to_end(key)   # O(1) - mark as most recently used
        return self._cache[key]

    def put(self, key, value) -> None:
        if key in self._cache:
            self._cache.move_to_end(key)  # Update recency
        self._cache[key] = value
        if len(self._cache) > self.capacity:
            self._cache.popitem(last=False)  # O(1) - evict least recently used

    def __repr__(self):
        items = list(self._cache.items())
        return f"LRUCache(capacity={self.capacity}, items={items})"

# Demo - trace the cache state
cache = LRUCache(capacity=3)
cache.put("user:1", {"name": "Alice", "score": 100})
cache.put("user:2", {"name": "Bob",   "score": 85})
cache.put("user:3", {"name": "Carol", "score": 92})
print(cache)
# LRUCache(capacity=3, items=[('user:1',...), ('user:2',...), ('user:3',...)])

cache.get("user:1")   # Access user:1 - moves to end (most recent)
cache.put("user:4", {"name": "Dave", "score": 78})  # user:2 evicted (least recent)
print(cache.get("user:2"))   # None - evicted
print(cache.get("user:1"))   # {'name': 'Alice', 'score': 100} - still cached

Why not dict + list?

Alternative: dict for O(1) lookup + list for order tracking
  - get: dict lookup O(1) + list.remove O(n) + list.append O(1) = O(n) total
  - put: same O(n)
  Fails: list.remove is O(n) - must scan to find the element

OrderedDict solution:
  - get: O(1) dict lookup + O(1) move_to_end = O(1) total
  - put: O(1) insert + O(1) move_to_end + O(1) popitem = O(1) total
  Wins: move_to_end is O(1) - backed by a doubly-linked list of entries

Case Study 4: Find Shortest Path (Dijkstra's Algorithm)

Problem: Find the shortest path between nodes in a weighted graph.

This is the canonical case study for combining multiple data structures - each chosen for its specific role.

import heapq
from collections import defaultdict
from typing import Optional

def dijkstra(graph: dict, start: str, end: str) -> tuple[Optional[float], list]:
    """
    Dijkstra's shortest path algorithm.

    Data structures chosen:
    - defaultdict(list): adjacency list - O(1) neighbor lookup per node
    - heapq: min-heap priority queue - O(log V) push/pop for smallest distance
    - dict (distances): O(1) distance lookup by node name
    - set (visited): O(1) membership check to skip processed nodes

    Time:  O((V + E) log V) where V = vertices, E = edges
    Space: O(V + E) for graph + O(V) for distances/visited
    """
    # Min-heap: (distance, node, path_so_far)
    heap = [(0, start, [start])]
    # Best known distance to each node
    distances = {start: 0}
    # Nodes fully processed (their shortest path is finalized)
    visited = set()

    while heap:
        dist, node, path = heapq.heappop(heap)  # O(log V)

        if node in visited:
            continue         # Already processed with a shorter path
        visited.add(node)

        if node == end:
            return dist, path

        for neighbor, weight in graph[node]:
            if neighbor not in visited:
                new_dist = dist + weight
                if new_dist < distances.get(neighbor, float('inf')):
                    distances[neighbor] = new_dist
                    heapq.heappush(heap, (new_dist, neighbor, path + [neighbor]))

    return None, []  # No path exists

# Build graph using defaultdict
graph = defaultdict(list)
edges = [
    ("A", "B", 4), ("A", "C", 2),
    ("B", "D", 3), ("B", "C", 1),
    ("C", "D", 5), ("C", "E", 8),
    ("D", "E", 2), ("D", "F", 6),
    ("E", "F", 1),
]
for u, v, w in edges:
    graph[u].append((v, w))
    graph[v].append((u, w))  # Undirected

distance, path = dijkstra(graph, "A", "F")
print(f"Shortest distance A → F: {distance}")  # 10
print(f"Path: {' → '.join(path)}")             # A → C → B → D → E → F

Why each structure was chosen:

defaultdict(list) for graph:
  → Adding edges: graph[u].append((v, w)) - no KeyError for first edge
  → Neighbor lookup: graph[node] - O(1)
  → Alternative (dict) would need: if u not in graph: graph[u] = []

heapq for priority queue:
  → Always extract the node with MINIMUM current distance
  → heappop: O(log V) - far better than scanning all V nodes O(V)
  → Alternative (sorted list): re-sorting after each insert is O(V log V) per step

dict for distances:
  → O(1) lookup and update: distances[neighbor] = new_dist
  → Alternative (list): only works if nodes are integers 0..V

set for visited:
  → O(1) membership check: if neighbor not in visited
  → Alternative (list): O(V) per check → O(V²) total = catastrophic

Case Study 5: Merge User Preferences from Multiple Configuration Sources

Problem: A web application has three sources of configuration, in priority order: (1) per-request overrides, (2) user preferences from DB, (3) global defaults. Implement efficient lookup that respects priority and stays live (changes to user prefs are immediately visible).

from collections import ChainMap

class UserPreferences:
    """
    Three-layer configuration with live updates.
    Uses ChainMap - no copying, O(k) lookup where k = number of layers (3).

    Priority: request_overrides > user_settings > global_defaults
    """
    GLOBAL_DEFAULTS = {
        'theme': 'light',
        'language': 'en',
        'notifications': True,
        'items_per_page': 20,
        'timezone': 'UTC',
        'currency': 'USD',
    }

    def __init__(self, user_id: int, user_settings: dict):
        self._user_settings = user_settings   # Live dict from DB
        self._request_overrides = {}           # Cleared per request
        self._config = ChainMap(
            self._request_overrides,
            self._user_settings,
            self.GLOBAL_DEFAULTS
        )

    def get(self, key: str):
        return self._config.get(key)

    def set_request_override(self, key: str, value):
        """Override for this request only. Clears itself at request end."""
        self._request_overrides[key] = value

    def clear_request_overrides(self):
        self._request_overrides.clear()

    def update_user_setting(self, key: str, value):
        """Persistent change - updates live dict, visible via ChainMap immediately."""
        self._user_settings[key] = value

    def effective_config(self) -> dict:
        """Full resolved config - for debugging/logging."""
        return dict(self._config)

# Demo
user_db_settings = {'theme': 'dark', 'language': 'fr', 'items_per_page': 50}
prefs = UserPreferences(user_id=42, user_settings=user_db_settings)

print(prefs.get('theme'))             # 'dark' - from user settings
print(prefs.get('timezone'))          # 'UTC' - from global defaults
print(prefs.get('currency'))          # 'USD' - from global defaults

# Admin temporarily overrides for A/B test - request scope only
prefs.set_request_override('items_per_page', 100)
print(prefs.get('items_per_page'))    # 100 - request override wins
prefs.clear_request_overrides()
print(prefs.get('items_per_page'))    # 50 - back to user setting

# User updates their preference - immediately visible (no cache invalidation needed)
prefs.update_user_setting('notifications', False)
print(prefs.get('notifications'))     # False - user setting now shadows default

Why ChainMap beats dict merging:

dict merging ({**defaults, **user, **overrides}):
  - Creates a copy at merge time - O(n) where n = total keys
  - Updating user_settings does NOT update the merged dict
  - Must re-merge on every update - O(n) per update
  - request_overrides must be re-merged for every request

ChainMap:
  - No copy - just stores references to the three dicts: O(1) setup
  - Updates to user_settings or request_overrides visible immediately
  - Lookup: O(k) where k = 3 (number of layers) - effectively O(1)
  - Adding/removing a layer: O(1) via new_child() or parents

Part 5 - Space vs Time Tradeoffs

The Fundamental Engineering Tradeoff

Memory and computation time exist in tension. Choosing a data structure is often choosing where you sit on this curve:

When to Buy Memory for Speed

# SCENARIO: Check if a document has any profanity word
# Called 10 million times per day

# Naive: linear scan each call
banned_words_list = ["spam", "scam", "fraud", ...]   # 10,000 words

def is_clean_slow(text: str) -> bool:
    words = text.lower().split()
    for word in words:
        if word in banned_words_list:   # O(n) - scans 10,000 words each time!
            return False
    return True
# Cost: O(W × B) where W = words in text, B = banned list size

# Better: convert to set once, use O(1) lookup
banned_words_set = set(banned_words_list)   # O(B) one-time cost, ~400KB memory

def is_clean_fast(text: str) -> bool:
    words = text.lower().split()
    return not any(word in banned_words_set for word in words)  # O(W)
# Cost: O(W) - B no longer matters

# Extra memory used: set overhead ≈ 400KB for 10,000 words
# Time saved: 10M calls × (10,000 - 1) operations = 100 BILLION operations saved per day
# Decision: obvious - spend 400KB to save trillions of operations

Memoization: Trading Space for Repeated Computation

from functools import lru_cache

# Fibonacci without memoization: O(2ⁿ) exponential time
def fib_slow(n):
    if n <= 1: return n
    return fib_slow(n-1) + fib_slow(n-2)
# fib_slow(40) calls fib(1) about 100 million times

# With lru_cache (memoization dict internally)
@lru_cache(maxsize=None)
def fib_fast(n):
    if n <= 1: return n
    return fib_fast(n-1) + fib_fast(n-2)
# fib_fast(40) calls each sub-problem ONCE: O(n) time, O(n) space

# Manual memoization for full control
def fibonacci(n: int, memo: dict = None) -> int:
    if memo is None:
        memo = {}
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    result = fibonacci(n-1, memo) + fibonacci(n-2, memo)
    memo[n] = result   # Store result - dict O(1) write
    return result

When NOT to Trade Space for Time

# Bad tradeoff: caching everything when memory is constrained
# and data is queried only once

# This caches ALL results in memory - wasteful if:
# 1. Queries are unique (no repeated keys)
# 2. Data is too large to fit in RAM
# 3. Cache invalidation is complex

all_results_cache = {}  # Grows unbounded

def get_report(date: str) -> dict:
    if date not in all_results_cache:
        all_results_cache[date] = compute_expensive_report(date)
    return all_results_cache[date]

# Fix: use LRU cache with bounded size
from functools import lru_cache

@lru_cache(maxsize=128)   # Keeps only last 128 results
def get_report_bounded(date: str) -> dict:
    return compute_expensive_report(date)

:::tip The Tradeoff Rule Buy memory for speed when: (1) the operation is called frequently (n > 1000 times), (2) the memory cost is bounded (set/dict grows proportional to unique elements, not total calls), (3) the time savings dominate the memory cost. Don't buy memory for speed when: data is queried once, the dataset doesn't fit in RAM, or cache invalidation complexity exceeds the savings. :::

Part 6 - The Anti-Patterns

Anti-Pattern 1: list.index() When You Need O(1) Lookup

# WRONG: O(n) lookup, called n times → O(n²)
user_ids = [101, 202, 303, 404, 505]
incoming_ids = [303, 101, 999, 404]

for uid in incoming_ids:
    try:
        pos = user_ids.index(uid)   # O(n) scan!
        print(f"Found {uid} at position {pos}")
    except ValueError:
        print(f"{uid} not found")

# RIGHT: dict for O(1) position lookup
uid_to_position = {uid: i for i, uid in enumerate(user_ids)}  # O(n) once

for uid in incoming_ids:
    if uid in uid_to_position:      # O(1)
        print(f"Found {uid} at position {uid_to_position[uid]}")
    else:
        print(f"{uid} not found")

Anti-Pattern 2: list for Membership When set Works

# WRONG: O(n) membership check in a loop → O(n²)
processed = []
for item in large_dataset:
    if item not in processed:       # O(n) scan!
        process(item)
        processed.append(item)

# RIGHT: O(1) membership check with set
processed = set()
for item in large_dataset:
    if item not in processed:       # O(1)
        process(item)
        processed.add(item)

# Note: set doesn't preserve order.
# If you need insertion order AND O(1) membership:
processed_set = set()
processed_ordered = []
for item in large_dataset:
    if item not in processed_set:
        process(item)
        processed_set.add(item)
        processed_ordered.append(item)

Anti-Pattern 3: list When Deque Is Needed for Front Operations

# WRONG: O(n²) total - pop(0) is O(n) each time
def process_queue_wrong(items: list) -> list:
    queue = items.copy()
    results = []
    while queue:
        item = queue.pop(0)    # O(n) - shifts all elements left!
        results.append(item * 2)
    return results

# RIGHT: O(n) total - popleft() is O(1)
from collections import deque
def process_queue_right(items: list) -> list:
    queue = deque(items)
    results = []
    while queue:
        item = queue.popleft()  # O(1)
        results.append(item * 2)
    return results

# Benchmark
import time
data = list(range(100_000))

start = time.time()
process_queue_wrong(data)
print(f"list.pop(0): {time.time()-start:.3f}s")    # ~2.5s

start = time.time()
process_queue_right(data)
print(f"deque.popleft(): {time.time()-start:.4f}s")  # ~0.008s

Anti-Pattern 4: Tuple When List Is Needed

# WRONG: tuples are immutable - cannot append in a loop
def build_report_wrong(events):
    report = ()
    for event in events:
        report = report + (event,)   # Creates a new tuple each time - O(n²) total!
    return report

# RIGHT: use list for accumulation, convert to tuple at the end if needed
def build_report_right(events):
    report = []
    for event in events:
        report.append(event)   # O(1) amortized
    return tuple(report)       # O(n) conversion at the end

Anti-Pattern 5: Sorting When Only Top-K Is Needed

from collections import Counter
import heapq

data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 9, 2]

# WRONG: O(n log n) full sort just to get top 3
top3_wrong = sorted(set(data), reverse=True)[:3]   # Sort all k unique

# RIGHT: O(n + k log n) heapq.nlargest - much faster for large n, small k
top3_right = heapq.nlargest(3, set(data))

# For frequency: Counter.most_common(k) uses heapq internally
freq = Counter(data)
top3_frequent = freq.most_common(3)   # O(n) count + O(k + 3 log k) = O(n)
print(top3_frequent)   # [(5, 3), (9, 2), (1, 2)]

Part 7 - Interview Framework: State, Justify, Analyze

In technical interviews, how you articulate your data structure choice matters as much as the choice itself. Use this three-part framework:

The SJA Framework

S - STATE the data structure you will use
J - JUSTIFY why (which operation it optimizes)
A - ANALYZE the complexity (time and space)

Example Interview Exchange

Q: "Given a list of n integers, find all pairs that sum to a target T."

Poor answer: "I'd use a set, then iterate."

Strong answer using SJA:

State: I'll use a set as a lookup table alongside the iteration.

Justify: For each element x, I need to check if T - x exists in the data. Using a list, that check is O(n) per element - O(n²) total. A set stores elements in a hash table, giving O(1) average-case membership testing.

Analyze: Time: O(n) to build the set, O(n) for the loop with O(1) checks per element → O(n) total. Space: O(n) for the set. This is optimal - we must look at each element at least once.

def find_pairs(nums: list, target: int) -> list:
    """O(n) time, O(n) space. Set for O(1) complement lookup."""
    seen = set()
    pairs = []
    for x in nums:
        complement = target - x
        if complement in seen:         # O(1) - hash lookup
            pairs.append((complement, x))
        seen.add(x)
    return pairs

print(find_pairs([2, 7, 11, 15, 4, 3], target=9))
# [(2, 7), (4, 5)]... depends on input - but always O(n)

Anti-Patterns in Interviews

WRONG: "I'll use a list and check if each element is in it."
  → Signals you don't know set membership is O(1)

WRONG: "I'll sort it first." (when you don't need the order)
  → Unnecessary O(n log n) when O(n) is possible

WRONG: "I'll use a dict." (for every problem)
  → No justification - sounds like guessing

WRONG: Choosing the structure, then justifying backward
  → Should be: analyze operations first, then choose

Part 8 - Module Synthesis: How the Lessons Connect

This lesson is the capstone of Module 4. Each previous lesson taught one structure deeply. Now you see how they fit together:

Module 4 - Data Structures Pythonic: The Full Picture

Lesson 01 - Lists Internals
  Taught: Dynamic array, pointer indirection, O(1) append amortized
  Use when: Ordered sequence, random access, general-purpose accumulation

Lesson 02 - List Comprehensions and Generators
  Taught: Lazy evaluation, memory efficiency for large sequences
  Use when: Building filtered/transformed collections without memory spikes

Lesson 06 - Dictionaries Deep Dive
  Taught: Hash tables, O(1) lookup, collision handling, insertion-order
  Use when: Key → value mapping, caching, counting (see Counter)

Lesson 08 - Sets and Frozensets
  Taught: Hash-based membership, mathematical set operations
  Use when: Deduplication, membership testing, set algebra

Lesson 10 - Tuples and Memory
  Taught: Immutability, hashability, lower memory vs list
  Use when: Fixed records, dict keys, return values, namedtuple base

Lesson 11 - Strings and Unicode
  Taught: Immutable string internals, encoding
  Applies: str is hashable - valid Counter/dict key, set element

Lesson 13 - Deque and Collections
  Taught: O(1) both-ends ops, maxlen rotating buffer, ChainMap, namedtuple
  Use when: FIFO/BFS, front insertions, sliding window, config layers

Lesson 14 - Counter and Defaultdict
  Taught: Frequency counting, default factory, KeyError elimination
  Use when: Counting, grouping, top-k, multiset operations

Lesson 15 (THIS) - Selection Strategy
  Teaches: The decision framework that unifies all of the above

The Unified Decision Table

# Cheat-sheet: problem statement keywords → data structure

KEYWORD_TO_STRUCTURE = {
    "frequency / count occurrences":        "Counter",
    "most common / top k":                  "Counter.most_common(k)",
    "group by / one-to-many":               "defaultdict(list)",
    "membership / contains / seen before":  "set",
    "deduplication":                        "set",
    "ordered sequence / random access":     "list",
    "front insertion / BFS queue / FIFO":   "deque",
    "sliding window / last N items":        "deque(maxlen=k)",
    "immutable record / tuple with names":  "typing.NamedTuple",
    "priority / always get minimum":        "heapq",
    "shortest path":                        "heapq + dict + defaultdict",
    "LRU cache / move to most recent":      "OrderedDict",
    "layered config / scoped lookup":       "ChainMap",
    "key-value with defaults":              "defaultdict",
    "fixed config / read-only mapping":     "dict (plain)",
    "hashable record / dict key":           "tuple or frozenset",
}

Interview Questions

Q1: Someone shows you code using `list.index()` inside a loop over n elements. What is the time complexity, and how do you fix it?

Answer: list.index(x) performs a linear scan - O(n). Called n times inside a loop makes the total complexity O(n²). The fix depends on what you need: if you need existence checking, convert the list to a set (O(n) to build, O(1) per check → O(n) total). If you need the position by value, build a dict mapping {value: index} from the list - O(n) to build, O(1) lookup → O(n) total. The key insight: whenever you call a linear-scan operation inside a loop, you have an implicit O(n²) that needs to be refactored using a hash-based structure.

Q2: You need to find the k most frequent elements in a list of n elements. What data structure and algorithm would you use? What is the complexity?

Answer: Use Counter to count frequencies - O(n) time, O(k) space where k is the number of unique elements. Then use Counter.most_common(k) which internally calls heapq.nlargest(k, counter.items()) - O(unique_k + k log unique_k) time. Total: O(n + unique_k). For k much smaller than unique elements, this is significantly faster than full sorting O(unique_k log unique_k). Alternative using only heapq: heapq.nlargest(k, counts.items(), key=lambda x: x[1]) directly - same complexity.

Q3: When is it correct to use a list for something you'd normally use a set for?

Answer: When any of these apply: (1) You need to preserve insertion order and cannot lose duplicates - set doesn't preserve order and eliminates duplicates. (2) The data is unhashable (contains lists, dicts, other mutable objects) - sets require hashable elements. (3) The dataset is very small (< 20 elements) - the constant overhead of hash table vs linear scan is negligible at small scales, and a list may be simpler. (4) You need multiple occurrences of the same element (multiset) - use Counter instead of set. (5) You need positional access data[i] - sets don't support indexing.

Q4: How would you choose between heapq and sorted() for a problem that requires repeatedly getting the minimum element as new items are added?

Answer: Use heapq. sorted() re-sorts the entire list - O(n log n) per sort. Each time you add a new item and want the new minimum, you'd re-sort - O(n log n) per insertion. With heapq: heappush is O(log n) per insertion, heappop (get minimum) is O(log n). For n insertions + n extractions, heapq costs O(n log n) total vs sorted() costs O(n² log n). Use sorted() only when you need the full sorted order once and the data is static. Use heapq when you repeatedly insert and extract minimum - i.e., an online or streaming setting.

Q5: Design a data structure to answer "how many distinct values appeared in the last k events" efficiently.

Answer: Use deque(maxlen=k) for the sliding window and a Counter for frequency tracking. When an element enters the window: counter[x] += 1. When an element leaves (evicted from deque): counter[x] -= 1; if counter[x] == 0: del counter[x]. The number of distinct values is len(counter). Each operation is O(1) amortized. Space: O(k) for the window.

from collections import deque, Counter

class DistinctInWindow:
    def __init__(self, k: int):
        self.window = deque(maxlen=k)
        self.counts = Counter()

    def add(self, x) -> int:
        if len(self.window) == self.window.maxlen:
            evicted = self.window[0]   # About to be evicted
            self.counts[evicted] -= 1
            if self.counts[evicted] == 0:
                del self.counts[evicted]
        self.window.append(x)
        self.counts[x] += 1
        return len(self.counts)        # Distinct count

tracker = DistinctInWindow(k=3)
for x in [1, 2, 1, 3, 4, 1]:
    print(f"Add {x}: {tracker.add(x)} distinct in last 3")
# Add 1: 1  (window: [1])
# Add 2: 2  (window: [1,2])
# Add 1: 2  (window: [1,2,1])
# Add 3: 3  (window: [2,1,3])
# Add 4: 3  (window: [1,3,4])
# Add 1: 3  (window: [3,4,1])

Q6: In an interview, a candidate says "I'll use a dict for this." What follow-up questions should reveal whether they truly understand the choice?

Answer: Strong follow-ups: (1) "What is the expected size of the dict, and how many unique keys?" - tests awareness that hash tables have O(n) worst-case when many collisions occur. (2) "Do you need the keys in insertion order? Python 3.7+ guarantees it, but does the algorithm rely on it?" - tests awareness of dict ordering. (3) "What happens if the same key is inserted twice - what should the dict store?" - tests whether they need defaultdict, Counter, or a custom merge strategy. (4) "Is the dict shared across threads?" - tests awareness that dict operations are not atomic (GIL nuances). (5) "What is your eviction or cleanup strategy?" - unbounded dicts are a memory leak; tests whether they need an LRU cache or TTL mechanism.

Practice Challenges

Level 1 - Choose the Structure

For each scenario, name the correct Python data structure and justify in one sentence:

You receive a stream of IP addresses. After every 1000 IPs, report which IP appeared most often.
You are building a web crawler. Track which URLs you have already visited.
Store a user's (first_name, last_name, age) for an immutable audit log entry.
Implement a print job queue where the highest-priority job always prints next.
Track the last 30 HTTP response times for a latency dashboard.

Solution

Counter - frequency counting in O(n), most_common(1) in O(k) using heapq.
set - O(1) membership test (url in visited), O(1) insert. A list would be O(n) per check → O(n²) total for n URLs.
typing.NamedTuple - immutable (audit integrity), hashable, named attributes for clarity, memory-efficient (no __dict__).
heapq - heappush(queue, (priority, job)) O(log n), heappop(queue) O(log n), always returns minimum-priority (highest priority) job. A sorted list would be O(n) per insert.
deque(maxlen=30) - O(1) append; automatically evicts the oldest when full; sum(window) / len(window) for average in O(30) = O(1). No manual index management needed.

Level 2 - Refactor the O(n²) Code

The following function ships to production and works for small inputs. Find the O(n²) bottleneck and rewrite it to O(n).

def process_events(events: list) -> dict:
    """
    Given a list of (user_id, event_type) tuples,
    return for each user_id: count of each event_type they triggered.
    """
    result = {}
    for user_id, event_type in events:
        if user_id not in result:
            result[user_id] = {}
        if event_type not in result[user_id]:
            result[user_id][event_type] = 0
        result[user_id][event_type] += 1
    return result

# This is actually O(n) - but can we make it cleaner AND add a method
# to get the top event type per user?

Solution

The original is already O(n) in time complexity, but the code is unnecessarily verbose and the nested dict initialization pattern is a defaultdict use case. Refactoring for clarity and adding the top-event feature:

from collections import defaultdict, Counter

def process_events_clean(events: list) -> defaultdict:
    """
    O(n) time, O(u × e) space where u = unique users, e = unique event types.
    Uses defaultdict(Counter) - inner Counter handles missing event_types.
    """
    result = defaultdict(Counter)
    for user_id, event_type in events:
        result[user_id][event_type] += 1   # defaultdict handles outer missing key
                                            # Counter handles inner missing event_type
    return result

def top_event_per_user(events: list) -> dict:
    """For each user, return their most frequent event type."""
    aggregated = process_events_clean(events)
    return {
        user_id: counter.most_common(1)[0]   # (event_type, count) tuple
        for user_id, counter in aggregated.items()
    }

sample = [
    (1, "login"), (2, "click"), (1, "purchase"), (1, "login"),
    (2, "login"), (3, "click"), (2, "click"), (1, "login"),
]

print(dict(process_events_clean(sample)))
# {1: Counter({'login': 3, 'purchase': 1}),
#  2: Counter({'click': 2, 'login': 1}),
#  3: Counter({'click': 1})}

print(top_event_per_user(sample))
# {1: ('login', 3), 2: ('click', 2), 3: ('click', 1)}

The defaultdict(Counter) pattern is idiomatic for this class of problem: outer defaultdict eliminates the outer key existence check, inner Counter eliminates the inner key existence check AND gives most_common() for free.

Level 3 - Design a Multi-Source Recommendation Engine

Design a system that:

Maintains a catalogue of items with metadata (id, name, score, category)
Supports O(1) lookup by item ID
Returns the top 10 highest-score items in a given category efficiently
Supports fast "is this item in the user's seen list?" checks
Processes a stream of user events to update item scores in real time

Describe your data structure choices for each requirement. Then implement a minimal working version.

Solution

Data structure choices:

Requirement 1 (O(1) lookup by ID):
  → dict: {item_id: item_dict}  - hash-based O(1)

Requirement 2 (top 10 per category):
  → defaultdict(list) of (score, item_id) pairs + heapq per category
  → Or: defaultdict(Counter) if scores are integers
  → For real-time updates: maintain a max-heap per category

Requirement 3 (user seen list, fast membership):
  → set per user: {item_id, ...} - O(1) membership check
  → NOT list - O(n) per check would be O(n²) for n items checked

Requirement 4 (update scores from event stream):
  → dict[item_id]["score"] += delta - O(1) update
  → Rebuild heap lazily (invalidate on update, rebuild when queried)
  → Or use sortedcontainers.SortedList for O(log n) insert+delete

Minimal implementation:

from collections import defaultdict
from typing import NamedTuple, Optional
import heapq

class Item(NamedTuple):
    item_id: int
    name: str
    category: str
    score: float = 0.0

class RecommendationEngine:
    """
    Catalogue with O(1) lookup, O(log n) score updates, O(k + n log k) top-k queries.
    """
    def __init__(self):
        self._items: dict[int, dict] = {}           # item_id → mutable item dict
        self._by_category: defaultdict[str, list] = defaultdict(list)  # category → [(score, id)]
        self._user_seen: defaultdict[int, set] = defaultdict(set)       # user_id → {item_ids}
        self._dirty_categories: set[str] = set()    # Categories needing heap rebuild

    def add_item(self, item_id: int, name: str, category: str, score: float = 0.0):
        self._items[item_id] = {'id': item_id, 'name': name,
                                 'category': category, 'score': score}
        heapq.heappush(self._by_category[category], (-score, item_id))

    def update_score(self, item_id: int, delta: float):
        """O(1) update. Marks category as dirty for lazy heap rebuild."""
        if item_id not in self._items:
            raise KeyError(f"Unknown item: {item_id}")
        self._items[item_id]['score'] += delta
        category = self._items[item_id]['category']
        self._dirty_categories.add(category)

    def _rebuild_heap(self, category: str):
        """O(k) where k = items in category. Called lazily."""
        items_in_cat = [
            (-self._items[iid]['score'], iid)
            for iid in set(iid for _, iid in self._by_category[category])
            if self._items[iid]['category'] == category
        ]
        heapq.heapify(items_in_cat)
        self._by_category[category] = items_in_cat
        self._dirty_categories.discard(category)

    def top_k(self, category: str, k: int = 10, user_id: Optional[int] = None) -> list:
        """
        Returns top k items in category, optionally filtering seen items.
        O(k log n) where n = items in category.
        """
        if category in self._dirty_categories:
            self._rebuild_heap(category)

        heap = list(self._by_category[category])
        seen = self._user_seen.get(user_id, set()) if user_id else set()
        results = []

        while heap and len(results) < k:
            neg_score, item_id = heapq.heappop(heap)
            if item_id not in seen and item_id in self._items:
                results.append(self._items[item_id])

        return results

    def mark_seen(self, user_id: int, item_id: int):
        """O(1) - set add."""
        self._user_seen[user_id].add(item_id)

    def is_seen(self, user_id: int, item_id: int) -> bool:
        """O(1) - set membership."""
        return item_id in self._user_seen[user_id]

# Demo
engine = RecommendationEngine()
engine.add_item(1, "Python for Engineers", "python", score=4.8)
engine.add_item(2, "Data Structures Deep Dive", "python", score=4.6)
engine.add_item(3, "ML Fundamentals", "ml", score=4.9)
engine.add_item(4, "Advanced Python", "python", score=4.7)

engine.mark_seen(user_id=42, item_id=1)
engine.update_score(item_id=2, delta=0.2)   # Score now 4.8

top = engine.top_k("python", k=2, user_id=42)
print([item['name'] for item in top])
# ['Data Structures Deep Dive', 'Advanced Python']
# (item 1 excluded - user 42 already saw it)

Summary of choices:

dict for O(1) item lookup by ID
defaultdict(list) + heapq per category for O(log n) inserts and O(k log n) top-k
defaultdict(set) for per-user seen lists with O(1) add and O(1) in checks
set for dirty category tracking - O(1) insert, O(1) discard
Lazy heap rebuild - avoids O(n log n) rebuild on every score update

Quick Reference

Problem Pattern	Data Structure	Key Operation	Complexity
Ordered sequence	`list`	`append`, `a[i]`	O(1) access, O(1)* append
Front insertions / BFS	`deque`	`appendleft`, `popleft`	O(1) both ends
Sliding window last-k	`deque(maxlen=k)`	`append` (auto-evicts)	O(1) per element
Key → value lookup	`dict`	`d[k]`, `k in d`	O(1) average
Membership testing	`set`	`x in s`	O(1) average
Deduplication	`set(iterable)`	constructor	O(n)
Frequency counting	`Counter`	`Counter(iterable)`	O(n)
Top-k by frequency	`Counter.most_common(k)`	heapq internally	O(n + k log n)
Grouping by key	`defaultdict(list)`	`dd[k].append(v)`	O(1) per insert
Priority / min	`heapq`	`heappush`, `heappop`	O(log n)
Immutable record	`NamedTuple`	attribute access	O(1)
Layered config	`ChainMap`	`chain[key]`	O(layers)
LRU cache	`OrderedDict`	`move_to_end`	O(1)
Memoization	`dict` / `lru_cache`	cache lookup	O(1) after first call
Shortest path	`heapq + dict + defaultdict`	Dijkstra	O((V+E) log V)

Key Takeaways

Data structure selection is an architectural decision driven by operation frequency - identify which operations dominate, then choose the structure where those operations are cheapest
The decision flowchart: ordered sequence → list or deque; key-value → dict; membership → set; immutable record → NamedTuple; priority → heapq; frequency → Counter; grouping → defaultdict(list); sliding window → deque(maxlen); config layering → ChainMap; LRU → OrderedDict
Space-time tradeoff is always explicit - hash-based structures (set, dict, Counter) trade O(n) memory for O(1) lookup; always calculate whether the time savings justify the memory cost
The most common production anti-patterns are list.index() in a loop (O(n²)), x in list inside loops (O(n²)), list.pop(0) in a queue pattern (O(n²)), and full sort when only top-k is needed - all fixed by a single structure change
In interviews, always state the structure, justify with the dominant operation, then analyze time and space - this demonstrates engineering reasoning, not memorization
The structures in this module are not alternatives to each other - they are composable: Dijkstra uses heapq + dict + defaultdict simultaneously; recommendation engines combine dict + defaultdict + heapq; LRU cache combines dict + doubly-linked list (via OrderedDict)
Counter.most_common(k) uses heapq.nlargest internally - O(n + k log n) - significantly faster than sorting when k << total unique elements; understanding internals lets you reason about when stdlib functions are already optimal

What You Will Learn​

Prerequisites​

Part 1 - The Decision Framework​

Operations × Frequency → Complexity Budget​

The Four Questions to Ask First​

Part 2 - The Decision Flowchart​

Part 3 - Structure-to-Use-Case Reference​

Complete Complexity Matrix​

Quick Selection Guide​

Part 4 - Five Real-World Case Studies​

Case Study 1: Find All Duplicates in a List​

Case Study 2: Top 5 Most Frequent Words​

Case Study 3: Implement an LRU Cache​

Case Study 4: Find Shortest Path (Dijkstra's Algorithm)​

Case Study 5: Merge User Preferences from Multiple Configuration Sources​

Part 5 - Space vs Time Tradeoffs​

The Fundamental Engineering Tradeoff​

When to Buy Memory for Speed​

Memoization: Trading Space for Repeated Computation​

When NOT to Trade Space for Time​

Part 6 - The Anti-Patterns​

Anti-Pattern 1: list.index() When You Need O(1) Lookup​

Anti-Pattern 2: list for Membership When set Works​

Anti-Pattern 3: list When Deque Is Needed for Front Operations​

Anti-Pattern 4: Tuple When List Is Needed​

Anti-Pattern 5: Sorting When Only Top-K Is Needed​

Part 7 - Interview Framework: State, Justify, Analyze​

The SJA Framework​

Example Interview Exchange​

Anti-Patterns in Interviews​

Part 8 - Module Synthesis: How the Lessons Connect​

The Unified Decision Table​

Interview Questions​

Q1: Someone shows you code using list.index() inside a loop over n elements. What is the time complexity, and how do you fix it?​

Q2: You need to find the k most frequent elements in a list of n elements. What data structure and algorithm would you use? What is the complexity?​

Q3: When is it correct to use a list for something you'd normally use a set for?​

Q4: How would you choose between heapq and sorted() for a problem that requires repeatedly getting the minimum element as new items are added?​

Q5: Design a data structure to answer "how many distinct values appeared in the last k events" efficiently.​

Q6: In an interview, a candidate says "I'll use a dict for this." What follow-up questions should reveal whether they truly understand the choice?​

Practice Challenges​

Level 1 - Choose the Structure​

Level 2 - Refactor the O(n²) Code​

Level 3 - Design a Multi-Source Recommendation Engine​

Quick Reference​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - The Decision Framework

Operations × Frequency → Complexity Budget

The Four Questions to Ask First

Part 2 - The Decision Flowchart

Part 3 - Structure-to-Use-Case Reference

Complete Complexity Matrix

Quick Selection Guide

Part 4 - Five Real-World Case Studies

Case Study 1: Find All Duplicates in a List

Case Study 2: Top 5 Most Frequent Words

Case Study 3: Implement an LRU Cache

Case Study 4: Find Shortest Path (Dijkstra's Algorithm)

Case Study 5: Merge User Preferences from Multiple Configuration Sources

Part 5 - Space vs Time Tradeoffs

The Fundamental Engineering Tradeoff

When to Buy Memory for Speed

Memoization: Trading Space for Repeated Computation

When NOT to Trade Space for Time

Part 6 - The Anti-Patterns

Anti-Pattern 1: list.index() When You Need O(1) Lookup

Anti-Pattern 2: list for Membership When set Works

Anti-Pattern 3: list When Deque Is Needed for Front Operations

Anti-Pattern 4: Tuple When List Is Needed

Anti-Pattern 5: Sorting When Only Top-K Is Needed

Part 7 - Interview Framework: State, Justify, Analyze

The SJA Framework

Example Interview Exchange

Anti-Patterns in Interviews

Part 8 - Module Synthesis: How the Lessons Connect

The Unified Decision Table

Interview Questions

Q1: Someone shows you code using `list.index()` inside a loop over n elements. What is the time complexity, and how do you fix it?

Q2: You need to find the k most frequent elements in a list of n elements. What data structure and algorithm would you use? What is the complexity?

Q3: When is it correct to use a list for something you'd normally use a set for?

Q4: How would you choose between heapq and sorted() for a problem that requires repeatedly getting the minimum element as new items are added?

Q5: Design a data structure to answer "how many distinct values appeared in the last k events" efficiently.

Q6: In an interview, a candidate says "I'll use a dict for this." What follow-up questions should reveal whether they truly understand the choice?

Practice Challenges

Level 1 - Choose the Structure

Level 2 - Refactor the O(n²) Code

Level 3 - Design a Multi-Source Recommendation Engine

Quick Reference

Key Takeaways