Python Time Complexity of Data Structures: Practice Problems & Exercises
Practice: Time Complexity of Data Structures
← Back to lessonEasy
Fill in the correct Big-O complexity for each operation. The function should return a list of strings describing the complexity of each operation.
def identify_complexities():
"""Print the Big-O complexity of each operation."""
operations = {
"lst[500]": "O(1)",
"lst.append(x)": "O(1) amortized",
"lst.insert(0, x)": "O(n)",
"x in lst": "O(n)",
"x in d": "O(1)",
"x in s": "O(1)",
"lst.sort()": "O(n log n)",
"len(lst)": "O(1)",
}
for op, complexity in operations.items():
print(f"{op}: {complexity}")
identify_complexities()Solution
def identify_complexities():
operations = {
"lst[500]": "O(1)", # Pointer arithmetic: base + i*8
"lst.append(x)": "O(1) amortized", # Writes to next slot, occasional resize
"lst.insert(0, x)": "O(n)", # Must shift ALL elements right
"x in lst": "O(n)", # Linear scan — no hash table
"x in d": "O(1)", # Dict hash lookup — average case
"x in s": "O(1)", # Set hash lookup — average case
"lst.sort()": "O(n log n)", # Timsort — comparison-based sort
"len(lst)": "O(1)", # Stored in ob_size field of struct
}
for op, complexity in operations.items():
print(f"{op}: {complexity}")
Key insights:
- Index access
lst[i]is always O(1) regardless ofi— Python lists store contiguous pointers, so any index is a single pointer arithmetic calculation. - Membership
inis the biggest trap: O(n) for list/tuple, O(1) average for dict/set. This single difference can cause 10,000x performance gaps at scale. len()is O(1) for all built-in collections — the length is stored in the struct, not computed by counting.list.append()is amortized O(1) — most calls are O(1), but occasional resizes cost O(n). The average over n appends is still O(1).
Expected Output
lst[500]: O(1)
lst.append(x): O(1) amortized
lst.insert(0, x): O(n)
x in lst: O(n)
x in d: O(1)
x in s: O(1)
lst.sort(): O(n log n)
len(lst): O(1)Hints
Hint 1: List index access uses pointer arithmetic — the position does not matter, it is always constant time.
Hint 2: Membership testing (x in collection) is O(n) for lists and tuples (linear scan) but O(1) for dicts and sets (hash lookup).
Predict which approach is faster and by roughly how much for checking if a value exists in a collection of 100,000 elements. Explain the Big-O reasoning.
data_list = list(range(100_000)) data_set = set(range(100_000)) target = 99_999 # worst case for list: last element # Which is faster? found_list = target in data_list # Approach A found_set = target in data_set # Approach B print(f"List lookup: slow (O(n) — scans up to 100,000 elements)") print(f"Set lookup: fast (O(1) — single hash computation)") print(f"Set is dramatically faster for membership testing")
Solution
List lookup: slow (O(n) — scans up to 100,000 elements)
Set lookup: fast (O(1) — single hash computation)
Set is dramatically faster for membership testing
Key insights:
target in data_listis O(n) — Python scans from index 0 to index 99,999 checking each element. With the target at the last position, it performs 100,000 comparisons.target in data_setis O(1) average — Python computeshash(99_999), maps it to a bucket index, and checks one (or very few) slots.- The actual speedup at n=100,000 is roughly 5,000-10,000x. At n=1,000,000, it becomes 50,000-100,000x.
- This is why choosing between list and set for membership testing is one of the most impactful performance decisions in Python.
Expected Output
List lookup: slow (O(n) — scans up to 100,000 elements)
Set lookup: fast (O(1) — single hash computation)
Set is dramatically faster for membership testingHints
Hint 1: The list must scan element by element until it finds the target or reaches the end. The set computes a hash and jumps directly to the bucket.
Hint 2: For the worst case (target is the last element or not present), list scans all n elements while set still does one hash lookup.
The following code inserts elements at the front of a list, which is O(n) per insertion and O(n squared) overall. Rewrite it using collections.deque to achieve O(1) per insertion and O(n) overall.
# SLOW VERSION — O(n²) total
def build_sequence_slow(n):
result = []
for i in range(n):
result.insert(0, i) # O(n) shift every time!
return result
print(build_sequence_slow(10))Solution
from collections import deque
def build_sequence_fast(n):
result = deque()
for i in range(n):
result.appendleft(i) # O(1) — no shifting
return list(result)
print(build_sequence_fast(10))
print("Correct!")
Key insights:
list.insert(0, x)is O(n) because it must shift every existing element one position to the right in contiguous memory.deque.appendleft(x)is O(1) because a deque uses a doubly-linked list of fixed-size blocks — it simply writes to the left end of the leftmost block.- For n=100,000 insertions: list approach takes ~1.8 seconds, deque approach takes ~0.007 seconds — a 260x speedup.
- Always use
dequewhen you need frequent insertions or removals at the front.
from collections import deque
def build_sequence_fast(n):
"""Build a sequence by prepending numbers 0..n-1.
Must use O(1) per insertion instead of O(n).
Return the result as a regular list.
"""
# TODO: Replace list.insert(0, x) with a deque-based approach
passExpected Output
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Correct!Hints
Hint 1: deque.appendleft() is O(1) — it writes to the left end without shifting any elements.
Hint 2: You can convert a deque back to a list with list(my_deque) after building it.
The naive approach builds a CSV string using += in a loop, which is O(n squared) due to string immutability. Rewrite using join() to achieve O(n).
# SLOW VERSION — O(n²)
def build_csv_slow(headers, rows):
result = ",".join(headers)
for row in rows:
result += "\n" # New string allocated!
result += ",".join(row) # Another new string!
return result
headers = ["name", "age", "city"]
rows = [["alice", "30", "nyc"], ["bob", "25", "sf"], ["charlie", "35", "la"]]
print(build_csv_slow(headers, rows))Solution
def build_csv_fast(headers, rows):
lines = [",".join(headers)]
for row in rows:
lines.append(",".join(row))
return "\n".join(lines)
headers = ["name", "age", "city"]
rows = [["alice", "30", "nyc"], ["bob", "25", "sf"], ["charlie", "35", "la"]]
print(build_csv_fast(headers, rows))
Key insights:
- Every
result += partcreates a brand new string, copying all previous content. After k iterations, the copy cost is proportional to k, and summing 1+2+...+n gives O(n squared). "\n".join(lines)computes total length first, allocates one buffer, then copies each piece exactly once — O(n) total.- For 100,000 rows:
+=takes ~1.2 seconds,join()takes ~0.003 seconds — roughly 400x faster. - This pattern applies everywhere: log messages, HTML generation, report building. Always collect parts in a list and join once.
def build_csv_fast(headers, rows):
"""Build a CSV string efficiently.
headers: list of column names
rows: list of lists (each inner list is one row)
Return the complete CSV as a single string.
"""
# TODO: Use join() instead of += in a loop
passExpected Output
name,age,city
alice,30,nyc
bob,25,sf
charlie,35,laHints
Hint 1: Use ",".join(row) to join columns within a row, and "\n".join(all_lines) to join all rows.
Hint 2: Build a list of line strings first, then join them all at once — this is O(n) total instead of O(n squared).
Medium
The naive approach uses a nested loop to find duplicates, running in O(n squared). Rewrite it to run in O(n) using a set for fast membership checking.
# SLOW VERSION — O(n²)
def find_duplicates_slow(items):
duplicates = []
for i, item in enumerate(items):
if item in items[i+1:] and item not in duplicates:
duplicates.append(item)
return duplicates
data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
print(find_duplicates_slow(data))Solution
def find_duplicates_fast(items):
seen = set()
duplicates = []
duplicates_set = set() # For O(1) "already in duplicates" check
for item in items:
if item in seen and item not in duplicates_set:
duplicates.append(item)
duplicates_set.add(item)
seen.add(item)
return duplicates
data = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
print(find_duplicates_fast(data))
Why the original is O(n squared):
item in items[i+1:]creates a slice (O(n)) and then scans it (O(n)) — this alone is O(n) per iteration.item not in duplicatesscans the duplicates list (O(k) where k grows).- Total: O(n) iterations x O(n) membership check = O(n squared).
Why the fix is O(n):
item in seenis O(1) average (set hash lookup).item not in duplicates_setis O(1) average (set hash lookup).seen.add(item)is O(1) average.- Total: O(n) iterations x O(1) per iteration = O(n).
def find_duplicates_fast(items):
"""Return a list of items that appear more than once.
Preserve the order of first duplicate occurrence.
Must run in O(n) time.
"""
# TODO: Use a set for O(1) membership checks
passExpected Output
[3, 5, 1]Hints
Hint 1: Track two sets: one for all items seen so far (for O(1) membership), and one for items already added to duplicates (to avoid adding the same duplicate twice).
Hint 2: For each item: if it is in seen AND not in duplicates output, add it to the result. Always add it to seen.
Given a list of words, return the k most frequent words sorted by frequency (descending). Break ties alphabetically. Choose the optimal data structures to achieve O(n log k) time.
# Test case words = ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] k = 3 # Expected: ['the', 'is', 'sunny'] — frequencies: the=3, is=3, sunny=2 # Tie between 'the' and 'is' broken alphabetically: is < the, # but both have freq 3 so both appear before sunny (freq 2)
Solution
import heapq
from collections import Counter
def top_k_frequent(words, k):
counts = Counter(words) # O(n) — dict-based counting
# Sort by (-frequency, word) so higher freq comes first,
# and ties are broken alphabetically
return heapq.nsmallest(
k,
counts.keys(),
key=lambda w: (-counts[w], w)
) # O(n log k)
words = ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"]
print(top_k_frequent(words, 3))
Complexity analysis:
- Counting:
Counter(words)iterates all n words, each dict update is O(1) average. Total: O(n). - Top-k extraction:
heapq.nsmallest(k, ...)maintains a heap of size k, processing all unique words. Cost: O(m log k) where m is unique words (m is at most n, usually much less). - Total: O(n + m log k) which is better than O(n log n) from full sorting when k is much smaller than n.
Why dict/Counter is the right choice:
- Need to map each word to its count — dict provides O(1) get/set.
- A list of tuples would require O(n) scanning to find existing entries.
- A set cannot store counts — only membership.
def top_k_frequent(words, k):
"""Return the k most frequent words, sorted by frequency (descending).
Break ties alphabetically.
Must run in O(n log k) time or better.
"""
# TODO: Use a dict for counting, then extract top k
passExpected Output
['the', 'is', 'sunny']Hints
Hint 1: Use a dict (or collections.Counter) to count word frequencies in O(n). Then extract the top k.
Hint 2: heapq.nlargest(k, items, key=...) runs in O(n log k) which is better than sorting all items when k is small.
The naive approach checks all k elements in each window position, giving O(n*k). Implement an O(n) solution using a deque to maintain a monotonic decreasing queue of candidates.
# SLOW VERSION — O(n*k)
def max_sliding_window_slow(nums, k):
return [max(nums[i:i+k]) for i in range(len(nums) - k + 1)]
nums = [1, 3, -1, -3, 5, 3, 6, 7]
k = 3
print(max_sliding_window_slow(nums, k))Solution
from collections import deque
def max_sliding_window(nums, k):
dq = deque() # Stores indices, values at indices are decreasing
result = []
for i, num in enumerate(nums):
# Remove indices outside the window
while dq and dq[0] <= i - k:
dq.popleft() # O(1)
# Remove indices whose values are smaller than current
# (they can never be max while current element exists in window)
while dq and nums[dq[-1]] <= num:
dq.pop() # O(1)
dq.append(i) # O(1)
# Window is fully formed once we've seen k elements
if i >= k - 1:
result.append(nums[dq[0]]) # Front of deque is always the max
return result
nums = [1, 3, -1, -3, 5, 3, 6, 7]
k = 3
print(max_sliding_window(nums, k))
Why this is O(n):
- Each index is added to the deque exactly once and removed at most once. Total operations across all iterations: O(n) additions + O(n) removals = O(2n) = O(n).
deque.popleft(),deque.pop(), anddeque.append()are all O(1).- The naive approach calls
max()on k elements per window: O(k) per position times O(n) positions = O(n*k).
Why deque and not list:
- We need O(1) removal from both ends.
list.pop(0)would be O(n) — shifting all remaining elements.deque.popleft()is O(1). - This is a textbook case where deque's O(1) front operations are essential for the algorithm's overall complexity.
from collections import deque
def max_sliding_window(nums, k):
"""Return the maximum value in each sliding window of size k.
Must run in O(n) time, not O(n*k).
"""
# TODO: Use a deque to track indices of potential maximums
passExpected Output
[3, 3, 5, 5, 6, 7]Hints
Hint 1: Use a deque to store indices (not values) of elements that could be the maximum of some future window. Keep the deque in decreasing order of values.
Hint 2: Before adding a new index, pop all indices from the right whose values are smaller than the current element — they can never be a window maximum.
Hint 3: Before reading the max, pop indices from the left that are outside the current window (index less than or equal to i - k).
Predict when list resizes occur during 20 consecutive appends. Track the byte size changes and explain why the amortized cost is O(1) despite individual resize operations being O(n).
import sys
lst = []
prev_size = sys.getsizeof(lst)
resize_count = 0
for i in range(20):
lst.append(i)
curr = sys.getsizeof(lst)
if curr != prev_size:
resize_count += 1
# Each pointer is 8 bytes on 64-bit; base overhead is 56 bytes
approx_capacity = (curr - 56) // 8
print(f"Resize at length {len(lst)}: {prev_size} -> {curr} bytes (capacity ~{approx_capacity})")
prev_size = curr
print(f"Total resizes in 20 appends: {resize_count}")
print(f"Amortized cost per append: O(1)")Solution
Resize at length 1: 56 -> 88 bytes (capacity ~4)
Resize at length 5: 88 -> 120 bytes (capacity ~8)
Resize at length 9: 120 -> 184 bytes (capacity ~16)
Resize at length 17: 184 -> 248 bytes (capacity ~25)
Total resizes in 20 appends: 4
Amortized cost per append: O(1)
Amortized analysis explanation:
- Resizes happen at lengths 1, 5, 9, 17 — exponentially spaced intervals.
- At each resize, Python copies all existing elements to a new, larger array.
- Copy costs: 0 + 4 + 8 + 16 = 28 total elements copied across all 4 resizes.
- Non-resize appends: 16 appends at O(1) each.
- Total work for 20 appends: 28 (resize copies) + 20 (individual writes) = 48 operations.
- Average per append: 48/20 = 2.4 = O(1).
The geometric series argument:
- If capacity doubles each time, resize costs form a geometric series: 1 + 2 + 4 + 8 + ... + n/2 = n - 1.
- Total work for n appends = n (individual writes) + (n - 1) (total resize copies) = O(n).
- Amortized per append = O(n)/n = O(1).
Expected Output
Resize at length 1: 56 -> 88 bytes (capacity ~4)
Resize at length 5: 88 -> 120 bytes (capacity ~8)
Resize at length 9: 120 -> 184 bytes (capacity ~16)
Resize at length 17: 184 -> 248 bytes (capacity ~25)
Total resizes in 20 appends: 4
Amortized cost per append: O(1)Hints
Hint 1: Python list over-allocates by roughly 12.5% plus a small constant. The capacity jumps are approximately: 0, 4, 8, 16, 25, 35, 46, ...
Hint 2: Count the total elements copied across all resizes. The geometric growth means total copy cost is O(n) for n appends, giving O(1) amortized per append.
Hard
Implement an LRU (Least Recently Used) cache where both get and put operations run in O(1) time. This requires combining a hash map for O(1) lookup with an ordered structure for O(1) eviction.
# Test case
cache = None # Replace with LRUCache(2)
# cache.put(1, 1)
# cache.put(2, 2)
# print(f"get(1) = {cache.get(1)}") # 1
# print(f"get(2) = {cache.get(2)}") # 2
# cache.put(3, 3) # evicts key 1 (LRU)
# print(f"get(1) after adding 3 = {cache.get(1)}") # -1 (evicted)
# print(f"get(2) after adding 3 = {cache.get(2)}") # 2
# print(f"get(3) = {cache.get(3)}") # 3Solution
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict() # Maintains insertion/access order
def get(self, key):
if key not in self.cache: # O(1) hash lookup
return -1
self.cache.move_to_end(key) # O(1) mark as most recently used
return self.cache[key] # O(1)
def put(self, key, value):
if key in self.cache: # O(1) hash lookup
self.cache.move_to_end(key) # O(1) update order
self.cache[key] = value # O(1) insert/update
if len(self.cache) > self.capacity:
self.cache.popitem(last=False) # O(1) evict LRU (front)
cache = LRUCache(2)
cache.put(1, 1)
cache.put(2, 2)
print(f"get(1) = {cache.get(1)}")
print(f"get(2) = {cache.get(2)}")
cache.put(3, 3) # evicts key 1
print(f"get(1) after adding 3 = {cache.get(1)}")
print(f"get(2) after adding 3 = {cache.get(2)}")
print(f"get(3) = {cache.get(3)}")
Complexity analysis for every operation:
| Operation | Method | Complexity |
|---|---|---|
| Key lookup | key in self.cache | O(1) avg — hash table |
| Move to end | move_to_end(key) | O(1) — internal doubly-linked list pointer update |
| Insert/update | self.cache[key] = value | O(1) avg — hash table |
| Evict LRU | popitem(last=False) | O(1) — remove from front of internal linked list |
| Length check | len(self.cache) | O(1) — stored in struct |
Why OrderedDict works:
OrderedDictcombines a hash table (for O(1) key lookup) with a doubly-linked list (for O(1) order manipulation).- A plain
dictpreserves insertion order in Python 3.7+ but has nomove_to_end()— you cannot efficiently re-order entries. - A list-based approach would require O(n) to find and move elements.
from collections import OrderedDict
class LRUCache:
"""Least Recently Used cache with O(1) get and put.
When the cache exceeds capacity, evict the least recently used item.
"""
def __init__(self, capacity):
# TODO: Choose data structures for O(1) operations
pass
def get(self, key):
"""Return value if key exists (and mark as recently used), else -1."""
# TODO
pass
def put(self, key, value):
"""Insert or update key-value. Evict LRU item if over capacity."""
# TODO
passExpected Output
get(1) = 1
get(2) = 2
get(1) after adding 3 = 1
get(2) after adding 3 = -1
get(3) = 3Hints
Hint 1: You need O(1) key lookup (dict) combined with O(1) order tracking (doubly-linked list or OrderedDict). A plain dict does not track access order.
Hint 2: OrderedDict.move_to_end(key) is O(1) and marks a key as most recently used. popitem(last=False) evicts the least recently used in O(1).
Hint 3: On get: if key exists, move_to_end then return value. On put: if key exists, update and move_to_end. If new and at capacity, popitem(last=False) first.
Write a benchmarking function that empirically proves the time complexity of an operation by measuring execution time at multiple input sizes and showing the growth factor. Use it to verify that list membership is O(n) and set membership is O(1).
import time
# Demonstrate: timing list vs set membership at 3 sizes
sizes = [1_000, 10_000, 100_000]
print("--- List membership (expect O(n) — linear growth) ---")
prev_time = None
print(f"{'n':>8} {'time_ms':>12} {'growth':>8}")
for n in sizes:
data = list(range(n))
target = n - 1
start = time.perf_counter()
for _ in range(1000):
_ = target in data
elapsed = (time.perf_counter() - start) / 1000 * 1000
growth = f"~{elapsed/prev_time:.0f}x" if prev_time else "--"
print(f"{n:>8,} {elapsed:>10.2f}ms {growth:>8}")
prev_time = elapsedSolution
import time
def benchmark_operation(setup_fn, test_fn, sizes, label=""):
print(f"--- {label} ---")
print(f"{'n':>8} {'time_ms':>12} {'growth':>8}")
prev_time = None
for n in sizes:
data = setup_fn(n)
start = time.perf_counter()
for _ in range(1000):
test_fn(data)
elapsed = (time.perf_counter() - start) / 1000 * 1000
growth = f"~{elapsed/prev_time:.0f}x" if prev_time else "--"
print(f"{n:>8,} {elapsed:>10.2f}ms {growth:>8}")
prev_time = elapsed
sizes = [1_000, 10_000, 100_000]
benchmark_operation(
setup_fn=lambda n: list(range(n)),
test_fn=lambda data: (len(data) - 1) in data,
sizes=sizes,
label="List membership (expect O(n) — linear growth)"
)
print()
benchmark_operation(
setup_fn=lambda n: set(range(n)),
test_fn=lambda data: (max(data)) in data,
sizes=sizes,
label="Set membership (expect O(1) — constant)"
)
How to interpret the growth factor:
| Observed Growth (10x input) | Implied Complexity |
|---|---|
| ~1x | O(1) — constant |
| ~3.3x | O(log n) — logarithmic |
| ~10x | O(n) — linear |
| ~33x | O(n log n) — linearithmic |
| ~100x | O(n squared) — quadratic |
Key insights:
- A single timing measurement is unreliable due to OS scheduling, cache effects, and garbage collection. Averaging over 1000+ runs smooths out noise.
- The growth factor is the key signal: if doubling input consistently doubles time, the operation is O(n). If time stays flat, it is O(1).
time.perf_counter()is more precise thantime.time()— it uses the highest-resolution clock available on your OS.- For production profiling, use the
timeitmodule which handles warmup, repetition, and GC disabling automatically.
import time
def benchmark_operation(setup_fn, test_fn, sizes):
"""Benchmark test_fn at different input sizes.
setup_fn(n) -> returns the data structure to test on
test_fn(data) -> performs the operation to benchmark
sizes -> list of input sizes to test
Print a table showing size, time, and growth factor.
"""
# TODO: For each size, run setup_fn, then time test_fn
# Print results showing how time scales with input size
passExpected Output
--- List membership (expect O(n) — linear growth) ---
n time_ms growth
1,000 0.03ms --
10,000 0.24ms ~10x
100,000 2.40ms ~10x
--- Set membership (expect O(1) — constant) ---
n time_ms growth
1,000 0.00ms --
10,000 0.00ms ~1x
100,000 0.00ms ~1xHints
Hint 1: To get reliable timing, run the operation many times (e.g., 1000 repetitions) and take the average. Single runs are too noisy.
Hint 2: Calculate the growth factor as time(n) / time(n_prev). For O(n) operations, 10x input should give ~10x time. For O(1), the ratio should stay near 1x.
You have a stream of log events. Design a pipeline that deduplicates events within a time window, counts actions, and tracks unique users — all in O(n) time. Choose the right data structure for each sub-task and justify your choices.
# Sample event stream
events = [
("user1", "click", 1.0),
("user1", "click", 1.5), # duplicate — same user+action within 2s
("user2", "click", 2.0),
("user1", "purchase", 3.5),
("user3", "view", 4.0),
("user1", "click", 4.5), # NOT duplicate — 3.5s since last click by user1
]
# Expected after processing with window_size=2.0:
# Deduplicated events: 4 (second click by user1 is a dup)
# Action counts: click=2, purchase=1, view=1 (user1 click at 4.5 counted since outside window)
# Unique users: 3Solution
from collections import Counter
def process_log_stream(events, window_size):
# dict: (user_id, action) -> last_timestamp — O(1) lookup/update
last_seen = {}
# Counter (dict subclass): action -> count — O(1) increment
action_counts = Counter()
# set: unique user IDs — O(1) add/membership
unique_users = set()
processed = 0
for user_id, action, timestamp in events: # O(n) iterations
key = (user_id, action)
# Dedup check: O(1) dict lookup
if key in last_seen and (timestamp - last_seen[key]) < window_size:
continue # Skip duplicate
# Update last seen: O(1) dict write
last_seen[key] = timestamp
# Count action: O(1) Counter increment
action_counts[action] += 1
# Track unique users: O(1) set add
unique_users.add(user_id)
processed += 1
return action_counts, len(unique_users), processed
events = [
("user1", "click", 1.0),
("user1", "click", 1.5),
("user2", "click", 2.0),
("user1", "purchase", 3.5),
("user3", "view", 4.0),
("user1", "click", 4.5),
]
counts, unique, processed = process_log_stream(events, window_size=2.0)
action_str = ", ".join(f"{a}={c}" for a, c in counts.items())
print(f"Action counts: {action_str}")
print(f"Unique users: {unique}")
print(f"Processed (after dedup): {processed} out of {len(events)} events")
Data structure choices and their complexity justification:
| Sub-task | Data Structure | Why | Per-operation Cost |
|---|---|---|---|
| Dedup tracking | dict (key: tuple, value: timestamp) | Need O(1) lookup by composite key + O(1) update | O(1) avg |
| Action counting | Counter (dict subclass) | Need O(1) increment by action name | O(1) avg |
| Unique user tracking | set | Need O(1) add + implicit dedup (set ignores duplicates) | O(1) avg |
Total complexity: O(n) — one pass through events, all operations inside the loop are O(1) average.
Space complexity: O(k) where k = number of unique (user, action) pairs + unique actions + unique users. Since k is at most n but typically much less, this is efficient.
Why NOT these alternatives:
- List for dedup:
(user_id, action) in seen_listwould be O(k) per check, making the whole pipeline O(n*k). - Sorting for dedup: Sorting by (user, action, time) would be O(n log n) — worse than the O(n) dict approach.
- Nested dicts:
last_seen[user_id][action]adds complexity without improving Big-O. A flat dict with tuple keys is simpler and equally fast.
from collections import deque, Counter
def process_log_stream(events, window_size):
"""Process a stream of log events efficiently.
For each event (user_id, action, timestamp):
1. Deduplicate: skip if same (user_id, action) seen within last window_size seconds
2. Count unique users per action
3. Return (action_counts, total_unique_users, total_processed)
Must run in O(n) time where n = len(events).
"""
# TODO: Choose optimal data structures for each task
passExpected Output
Action counts: click=2, purchase=1, view=1
Unique users: 3
Processed (after dedup): 4 out of 6 eventsHints
Hint 1: Use a dict mapping (user_id, action) to the latest timestamp for O(1) dedup checks. An event is a duplicate if the same key was seen within window_size seconds.
Hint 2: Use a Counter (dict subclass) for action counts and a set for tracking unique users — both provide O(1) operations.
Hint 3: Every operation inside the loop must be O(1) average. If any operation is O(n), the whole pipeline becomes O(n squared).
