Python Data Structure Selection Strategy: Practice Problems & Exercises
Practice: Data Structure Selection Strategy
← Back to lessonEasy
You receive a list of 5,000 banned usernames. Given a list of 10 usernames to check, determine which ones are banned. The naive approach uses in on the list directly. Refactor to use a set for O(1) lookups.
# Setup — do not modify
banned_list = [f"bad_user_{i}" for i in range(5000)]
usernames_to_check = ["bad_user_42", "good_user", "bad_user_4999",
"alice", "bad_user_0", "bob",
"bad_user_2500", "carol", "bad_user_100", "dave"]
# Step 1: Convert banned_list to a set for O(1) lookups
banned_set = set(banned_list)
# Step 2: Check membership using the set
found_list = "bad_user_42" in banned_list # O(n) — slow
found_set = "bad_user_42" in banned_set # O(1) — fast
print(f"Found via list: {found_list}")
print(f"Found via set: {found_set}")
print(f"Banned count: {len(banned_set)}")Solution
banned_list = [f"bad_user_{i}" for i in range(5000)]
usernames_to_check = ["bad_user_42", "good_user", "bad_user_4999",
"alice", "bad_user_0", "bob",
"bad_user_2500", "carol", "bad_user_100", "dave"]
banned_set = set(banned_list)
found_list = "bad_user_42" in banned_list
found_set = "bad_user_42" in banned_set
print(f"Found via list: {found_list}")
print(f"Found via set: {found_set}")
print(f"Banned count: {len(banned_set)}")
Why this matters:
"bad_user_42" in banned_listscans up to 5,000 elements — O(n)."bad_user_42" in banned_sethashes the string and checks a bucket — O(1) average.- If you check 10 usernames against a list of 5,000: 10 x 5,000 = 50,000 comparisons.
- With a set: one-time O(5,000) conversion + 10 x O(1) lookups = ~5,010 operations.
- Rule of thumb: If you check membership more than once against the same collection, convert to a set first.
Expected Output
Found via list: True\nFound via set: True\nBanned count: 5000Hints
Hint 1: A list membership check (`x in my_list`) scans every element — O(n) per check.
Hint 2: Converting the list to a set gives O(1) average membership checks. The one-time conversion is O(n), but each subsequent lookup is O(1).
Given a log stream with repeated event types, produce a deduplicated list that preserves the order of first appearance. Do NOT use a loop — find a one-liner using dict.fromkeys().
log_events = ["error", "warning", "info", "error", "debug",
"warning", "info", "critical", "error", "debug"]
# One-liner: use dict.fromkeys to dedup while preserving order
deduped = list(dict.fromkeys(log_events))
print(f"Deduped: {deduped}")Solution
log_events = ["error", "warning", "info", "error", "debug",
"warning", "info", "critical", "error", "debug"]
deduped = list(dict.fromkeys(log_events))
print(f"Deduped: {deduped}")
Why dict.fromkeys beats alternatives:
set(log_events):
- Removes duplicates but LOSES order (hash-based, not insertion-ordered)
- Output could be: {'debug', 'critical', 'error', 'info', 'warning'}
Manual loop with seen-set:
seen = set()
result = []
for e in log_events:
if e not in seen:
seen.add(e)
result.append(e)
- Works, preserves order, but 5 lines for what should be trivial
dict.fromkeys(log_events):
- Keys are unique (duplicates silently dropped)
- dict preserves insertion order (Python 3.7+ guaranteed)
- One line, O(n) time, O(k) space where k = unique items
When to use each:
- Need unique items, order doesn't matter:
set() - Need unique items, order matters:
dict.fromkeys()orlist(dict.fromkeys()) - Need unique items with extra logic per item: manual loop with
seenset
Expected Output
Deduped: ['error', 'warning', 'info', 'debug', 'critical']Hints
Hint 1: `set()` removes duplicates but does NOT preserve order.
Hint 2: `dict.fromkeys(sequence)` preserves insertion order (Python 3.7+) and silently drops duplicate keys. The values are all `None`, but `list(dict.fromkeys(seq))` gives you the unique items in original order.
Given a list of (name, department) tuples, group employees by department. Use defaultdict(list) so you never need to check if a key exists before appending.
from collections import defaultdict
employees = [
("Alice", "engineering"),
("Bob", "marketing"),
("Carol", "engineering"),
("Dave", "design"),
("Eve", "design"),
]
# Group by department using defaultdict
groups = defaultdict(list)
for name, dept in employees:
groups[dept].append(name)
# Print each department
for dept in ["engineering", "marketing", "design"]:
print(f"{dept}: {groups[dept]}")Solution
from collections import defaultdict
employees = [
("Alice", "engineering"),
("Bob", "marketing"),
("Carol", "engineering"),
("Dave", "design"),
("Eve", "design"),
]
groups = defaultdict(list)
for name, dept in employees:
groups[dept].append(name)
for dept in ["engineering", "marketing", "design"]:
print(f"{dept}: {groups[dept]}")
defaultdict vs regular dict:
Regular dict approach (verbose, error-prone):
groups = {}
for name, dept in employees:
if dept not in groups: # Must check every time
groups[dept] = []
groups[dept].append(name)
defaultdict approach (clean):
groups = defaultdict(list)
for name, dept in employees:
groups[dept].append(name) # Auto-creates [] for new keys
Key insight: defaultdict(list) is the canonical pattern for one-to-many grouping in Python. The factory function (list) is called whenever a missing key is accessed, creating an empty list automatically.
Other common factories:
defaultdict(int)— counting (default 0)defaultdict(set)— unique groupingdefaultdict(lambda: "unknown")— custom default value
Expected Output
engineering: ['Alice', 'Carol']\nmarketing: ['Bob']\ndesign: ['Dave', 'Eve']Hints
Hint 1: `defaultdict(list)` auto-creates an empty list for any new key, so `groups[key].append(value)` never raises a `KeyError`.
Hint 2: A regular `dict` would require: `if key not in d: d[key] = []` before every append — error-prone and verbose.
Implement a sliding window of size 3 over a stream of numbers. Use deque(maxlen=3) to automatically maintain the window. Compute the moving average at each step.
from collections import deque
stream = [1, 2, 3, 4, 5]
window = deque(maxlen=3)
averages = []
for val in stream:
window.append(val)
print(f"After {val}: {list(window)}")
averages.append(sum(window) / len(window))
print(f"Moving averages: {averages}")Solution
from collections import deque
stream = [1, 2, 3, 4, 5]
window = deque(maxlen=3)
averages = []
for val in stream:
window.append(val)
print(f"After {val}: {list(window)}")
averages.append(sum(window) / len(window))
print(f"Moving averages: {averages}")
Why deque(maxlen=k) beats a list for sliding windows:
List approach:
window = []
window.append(val)
if len(window) > k:
window.pop(0) # O(n) — shifts all elements left!
deque approach:
window = deque(maxlen=k)
window.append(val) # O(1) — auto-evicts oldest if full
list.pop(0)is O(n) because every remaining element must shift one position left.deque.append()withmaxlenis O(1) because the deque is backed by a doubly-linked list (or ring buffer in CPython) — the oldest element is simply unlinked.- For a stream of 1 million values with window size 1,000: list approach does ~1 billion shifts; deque does ~1 million constant-time appends.
Expected Output
After 1: [1]\nAfter 2: [1, 2]\nAfter 3: [1, 2, 3]\nAfter 4: [2, 3, 4]\nAfter 5: [3, 4, 5]\nMoving averages: [1.0, 1.5, 2.0, 3.0, 4.0]Hints
Hint 1: `deque(maxlen=k)` automatically evicts the oldest element when a new one is appended beyond capacity — no manual bookkeeping needed.
Hint 2: For the moving average, compute `sum(window) / len(window)` after each append. Early windows (before the deque is full) will have fewer than `k` elements.
Medium
The following function finds duplicate items in a list but runs in O(n squared) time. Refactor it to O(n) using Counter and a set. Preserve the order of first occurrence.
from collections import Counter
def find_duplicates(lst):
"""Find all items appearing more than once. Preserve first-occurrence order. O(n) time."""
counts = Counter(lst)
seen = set()
result = []
for item in lst:
if counts[item] > 1 and item not in seen:
result.append(item)
seen.add(item)
return result
# Test
names = ["alice", "bob", "charlie", "alice", "dave",
"charlie", "eve", "bob", "alice"]
print(f"Duplicates: {find_duplicates(names)}")
print(f"Time complexity: O(n)")Solution
from collections import Counter
def find_duplicates(lst):
"""Find all items appearing more than once. Preserve first-occurrence order. O(n) time."""
counts = Counter(lst) # O(n) — one pass to count everything
seen = set()
result = []
for item in lst: # O(n) — one pass to collect in order
if counts[item] > 1 and item not in seen: # O(1) dict + O(1) set
result.append(item)
seen.add(item)
return result
names = ["alice", "bob", "charlie", "alice", "dave",
"charlie", "eve", "bob", "alice"]
print(f"Duplicates: {find_duplicates(names)}")
print(f"Time complexity: O(n)")
The naive O(n squared) version and why it fails:
# SLOW — do not use at scale
def find_dups_naive(lst):
result = []
for i, item in enumerate(lst):
if item in lst[i+1:]: # O(n) slice + O(n) scan per iteration
if item not in result: # O(n) scan of result list
result.append(item)
return result
Complexity comparison:
| Operation | Naive | Refactored |
|---|---|---|
| Count occurrences | O(n) per item x n items = O(n squared) | Counter: O(n) single pass |
| Check "already added" | item not in result — O(n) list scan | item not in seen — O(1) set check |
| Total | O(n squared) | O(n) |
For n = 100,000 items: naive does ~10 billion operations; refactored does ~200,000.
Expected Output
Duplicates: ['alice', 'charlie', 'bob']\nTime complexity: O(n)Hints
Hint 1: The bottleneck in the naive version is `item in lst[i+1:]` inside a loop — that is O(n) per iteration, O(n squared) total.
Hint 2: Use `Counter` for a single O(n) pass to count occurrences, then collect items with count > 1 in a second O(n) pass. Use a `seen` set to preserve first-occurrence order without duplicates.
Given a text document, find the top 3 most frequent words (excluding stop words). Use Counter.most_common() instead of sorting the entire frequency table.
from collections import Counter
import re
text = """
Python is the language of data science. Python handles data processing,
data analysis, and data visualization. Engineers write Python code to
build systems. Clean code in Python makes Python the top choice.
Python code is readable code.
"""
stop_words = {"is", "the", "of", "and", "to", "in", "a", "an", "it", "for"}
# Step 1: Tokenize and filter
words = re.findall(r'\b\w+\b', text.lower())
filtered = [w for w in words if w not in stop_words]
# Step 2: Count and get top 3
counts = Counter(filtered)
top_3 = counts.most_common(3)
print(f"Top 3 words: {top_3}")
print(f"Used heapq internally: O(k + n*log(k))")Solution
from collections import Counter
import re
text = """
Python is the language of data science. Python handles data processing,
data analysis, and data visualization. Engineers write Python code to
build systems. Clean code in Python makes Python the top choice.
Python code is readable code.
"""
stop_words = {"is", "the", "of", "and", "to", "in", "a", "an", "it", "for"}
words = re.findall(r'\b\w+\b', text.lower())
filtered = [w for w in words if w not in stop_words]
counts = Counter(filtered)
top_3 = counts.most_common(3)
print(f"Top 3 words: {top_3}")
print(f"Used heapq internally: O(k + n*log(k))")
Why most_common(k) beats sorting:
Approach 1 — Full sort:
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:3]
Time: O(m * log(m)) where m = unique words
Sorts ALL items just to get top 3
Approach 2 — most_common(k):
counts.most_common(3)
Time: O(m + 3 * log(m)) ≈ O(m)
Uses heapq.nlargest internally — maintains a heap of size k
For m = 100,000 unique words:
Full sort: 100,000 * 17 ≈ 1.7M comparisons
most_common: 100,000 + 3*17 ≈ 100,051 operations
Speedup: ~17x
Common mistake: Using sorted() when you only need the top few items. most_common(k) is specifically optimized for this pattern.
Expected Output
Top 3 words: [('python', 6), ('data', 4), ('code', 3)]\nUsed heapq internally: O(k + n*log(k))Hints
Hint 1: `Counter.most_common(k)` uses `heapq.nlargest` internally — it does NOT sort the entire dictionary.
Hint 2: The complexity is O(n) to build the Counter + O(k + m*log(k)) for `most_common(k)` where m is the number of unique items. This beats O(m*log(m)) from sorting all items.
Implement both a FIFO queue and a LIFO stack using deque. Process three tasks through each and show the different ordering. Explain why list.pop(0) is the wrong choice for a queue.
from collections import deque
tasks = ["task_1", "task_2", "task_3"]
# FIFO Queue — first in, first out
queue = deque()
for t in tasks:
queue.append(t)
queue_order = []
while queue:
queue_order.append(queue.popleft()) # O(1) — remove from left
# LIFO Stack — last in, first out
stack = deque()
for t in tasks:
stack.append(t)
stack_order = []
while stack:
stack_order.append(stack.pop()) # O(1) — remove from right
print(f"Queue (FIFO): {', '.join(queue_order)}")
print(f"Stack (LIFO): {', '.join(stack_order)}")
print(f"Deque is O(1) for both ends")Solution
from collections import deque
tasks = ["task_1", "task_2", "task_3"]
# FIFO Queue — first in, first out
queue = deque()
for t in tasks:
queue.append(t)
queue_order = []
while queue:
queue_order.append(queue.popleft())
# LIFO Stack — last in, first out
stack = deque()
for t in tasks:
stack.append(t)
stack_order = []
while stack:
stack_order.append(stack.pop())
print(f"Queue (FIFO): {', '.join(queue_order)}")
print(f"Stack (LIFO): {', '.join(stack_order)}")
print(f"Deque is O(1) for both ends")
Why deque beats list for queues:
Operation | list | deque
-----------------+-------------+---------
append (right) | O(1) amort | O(1)
pop (right) | O(1) | O(1)
pop(0) / popleft | O(n) !! | O(1)
insert(0, x) | O(n) !! | O(1)
list.pop(0)shifts every remaining element one position left — O(n) per call.- Processing n items through a list-based queue: O(n squared) total.
deque.popleft()is O(1) — the deque is implemented as a doubly-linked list of fixed-size blocks.- Rule: Use
dequeany time you need to add or remove from both ends.
Expected Output
Queue (FIFO): task_1, task_2, task_3\nStack (LIFO): task_3, task_2, task_1\nDeque is O(1) for both endsHints
Hint 1: A queue (FIFO) processes items in the order they arrived: `append()` to add, `popleft()` to remove. A stack (LIFO) processes the most recent item first: `append()` to add, `pop()` to remove.
Hint 2: Using a list for a queue is a performance trap: `list.pop(0)` is O(n) because it shifts all elements. `deque.popleft()` is O(1).
Given a list of numbers and a target sum, find two numbers that add up to the target. Return both the values and their indices. Use a dict for O(n) time instead of the O(n squared) brute-force approach.
def two_sum(nums, target):
"""Find two numbers that sum to target. Return (val1, val2, idx1, idx2) or None."""
seen = {} # value -> index
for i, num in enumerate(nums):
complement = target - num
if complement in seen: # O(1) dict lookup
j = seen[complement]
return (complement, num, j, i)
seen[num] = i # Store for future lookups
return None
# Test
numbers = [11, 2, 15, 7, 3]
target = 9
result = two_sum(numbers, target)
if result:
val1, val2, idx1, idx2 = result
print(f"Pair found: ({val1}, {val2}) at indices ({idx1}, {idx2})")
print(f"Time complexity: O(n)")Solution
def two_sum(nums, target):
"""Find two numbers that sum to target. Return (val1, val2, idx1, idx2) or None."""
seen = {}
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
j = seen[complement]
return (complement, num, j, i)
seen[num] = i
return None
numbers = [11, 2, 15, 7, 3]
target = 9
result = two_sum(numbers, target)
if result:
val1, val2, idx1, idx2 = result
print(f"Pair found: ({val1}, {val2}) at indices ({idx1}, {idx2})")
print(f"Time complexity: O(n)")
Brute-force vs dict approach:
Brute-force:
for i in range(len(nums)):
for j in range(i+1, len(nums)):
if nums[i] + nums[j] == target: ...
Time: O(n^2) — checks every pair
Space: O(1)
Dict approach:
For each num, ask: "Is (target - num) in my dict?"
Time: O(n) — single pass, O(1) lookup each
Space: O(n) — stores up to n entries
This is the classic space-time tradeoff: spend O(n) extra memory to reduce time from O(n squared) to O(n). The dict acts as a lookup table that turns a nested search into a single-pass scan.
Interview note: This is the most-asked coding interview question (LeetCode #1). The interviewer wants to see you reach for a hash map, not nested loops.
Expected Output
Pair found: (2, 7) at indices (1, 3)\nTime complexity: O(n)Hints
Hint 1: The brute-force approach checks every pair — O(n squared). The key insight: for each number `x`, you need to find if `target - x` exists somewhere in the array.
Hint 2: Build a dict mapping each number to its index as you iterate. For each new number, check if `target - num` is already in the dict. This gives O(1) lookup per element.
Hard
Implement an LRU (Least Recently Used) cache with O(1) get and O(1) put using OrderedDict. The cache should have a fixed capacity and evict the least recently used item when full. After the operations below, "b" should be evicted because it was least recently used when "d" was inserted.
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self._cache = OrderedDict()
def get(self, key):
if key not in self._cache:
return None
self._cache.move_to_end(key) # Mark as most recently used
return self._cache[key]
def put(self, key, value):
if key in self._cache:
self._cache.move_to_end(key)
self._cache[key] = value
if len(self._cache) > self.capacity:
self._cache.popitem(last=False) # Evict least recently used
def items(self):
return list(self._cache.items())
# Test sequence
cache = LRUCache(capacity=3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
# State: a -> b -> c (a is least recent)
cache.get("a")
# State: b -> c -> a (b is now least recent)
cache.put("d", 4)
# Capacity exceeded — evict "b" (least recent)
# State: c -> a -> d
print(f"cache.get('a'): {cache.get('a')}")
print(f"cache.get('b'): {cache.get('b')}")
print(f"cache.get('c'): {cache.get('c')}")
print(f"cache.get('d'): {cache.get('d')}")
print(f"Final state: {cache.items()}")Solution
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self._cache = OrderedDict()
def get(self, key):
if key not in self._cache:
return None
self._cache.move_to_end(key)
return self._cache[key]
def put(self, key, value):
if key in self._cache:
self._cache.move_to_end(key)
self._cache[key] = value
if len(self._cache) > self.capacity:
self._cache.popitem(last=False)
def items(self):
return list(self._cache.items())
cache = LRUCache(capacity=3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
cache.get("a")
cache.put("d", 4)
print(f"cache.get('a'): {cache.get('a')}")
print(f"cache.get('b'): {cache.get('b')}")
print(f"cache.get('c'): {cache.get('c')}")
print(f"cache.get('d'): {cache.get('d')}")
print(f"Final state: {cache.items()}")
Why OrderedDict and not dict + list:
dict + list approach:
get: dict[key] O(1) + list.remove(key) O(n) + list.append(key) O(1) = O(n)
put: same O(n) — list.remove is the bottleneck
OrderedDict approach:
get: dict lookup O(1) + move_to_end O(1) = O(1)
put: dict insert O(1) + move_to_end O(1) + popitem O(1) = O(1)
How OrderedDict achieves O(1) for move_to_end: Internally, it maintains a doubly-linked list alongside the hash table. Each entry has prev/next pointers. move_to_end unlinks the node and relinks it at the tail — pure pointer manipulation, no scanning.
Production note: In real applications, consider functools.lru_cache for function memoization or cachetools.LRUCache for a production-tested implementation. The manual version above is what interviewers expect you to build.
Expected Output
cache.get('a'): 1\ncache.get('b'): None\ncache.get('c'): 3\ncache.get('d'): 4\nFinal state: [('c', 3), ('a', 1), ('d', 4)]Hints
Hint 1: `OrderedDict.move_to_end(key)` moves a key to the end (most recent) in O(1). This is the critical operation that makes LRU possible without a separate linked list.
Hint 2: On `put`: if the key exists, update its value and move to end. If inserting a new key exceeds capacity, call `popitem(last=False)` to evict the least recently used (leftmost) item in O(1).
Build a three-layer configuration system using ChainMap: request overrides (highest priority), user settings, and global defaults (lowest priority). Demonstrate that changes to underlying dicts are immediately visible without rebuilding the ChainMap.
from collections import ChainMap
# Layer 3 (lowest priority): global defaults
defaults = {
"theme": "light",
"language": "en",
"page_size": 20,
"notifications": True,
"timezone": "UTC",
"currency": "USD",
}
# Layer 2: user settings (from DB)
user_settings = {"theme": "dark", "page_size": 50}
# Layer 1 (highest priority): per-request overrides
request_overrides = {}
# Build the ChainMap — searches left to right
config = ChainMap(request_overrides, user_settings, defaults)
# Test priority resolution
print(f"theme: {config['theme']}") # dark — user wins over default
print(f"language: {config['language']}") # en — falls through to default
# Test request override
request_overrides["page_size"] = 100
print(f"page_size (with override): {config['page_size']}") # 100 — override wins
# Clear overrides
request_overrides.clear()
print(f"page_size (after clear): {config['page_size']}") # 50 — back to user setting
# Live update — no rebuild needed
user_settings["notifications"] = False
print(f"notifications: {config['notifications']}") # False — immediate
print(f"Full config keys: {len(config)}")Solution
from collections import ChainMap
defaults = {
"theme": "light",
"language": "en",
"page_size": 20,
"notifications": True,
"timezone": "UTC",
"currency": "USD",
}
user_settings = {"theme": "dark", "page_size": 50}
request_overrides = {}
config = ChainMap(request_overrides, user_settings, defaults)
print(f"theme: {config['theme']}")
print(f"language: {config['language']}")
request_overrides["page_size"] = 100
print(f"page_size (with override): {config['page_size']}")
request_overrides.clear()
print(f"page_size (after clear): {config['page_size']}")
user_settings["notifications"] = False
print(f"notifications: {config['notifications']}")
print(f"Full config keys: {len(config)}")
Why ChainMap beats dict merging:
Dict merge approach:
config = {**defaults, **user_settings, **request_overrides}
- Creates a NEW dict with all keys copied — O(n) time + O(n) space
- NOT live: changes to user_settings are NOT reflected
- Must rebuild the merged dict every time any layer changes
ChainMap approach:
config = ChainMap(request_overrides, user_settings, defaults)
- Holds REFERENCES to the original dicts — O(1) construction
- LIVE: updating user_settings["theme"] is immediately visible via config
- No rebuild needed — just modify the underlying dict
- Lookup cost: O(k) where k = number of layers (typically 2-4)
When to use ChainMap:
- Layered configuration (env vars, config file, defaults)
- Scope chains (local, enclosing, global, builtin — Python's own LEGB rule)
- Template rendering with nested contexts
- Any scenario where you need priority-based lookup across multiple dicts that may change independently
Expected Output
theme: dark\nlanguage: en\npage_size (with override): 100\npage_size (after clear): 50\nnotifications: False\nFull config keys: 6Hints
Hint 1: `ChainMap` takes multiple dicts and searches them in order — first dict with the key wins. It does NOT merge or copy the dicts.
Hint 2: Because ChainMap holds references (not copies), updating the underlying dicts is immediately visible through the ChainMap. This makes it ideal for layered configuration where different layers can change independently.
Design a product analytics tracker that supports four operations efficiently. For each operation, choose the optimal data structure. Implement all four:
- Top-K categories — track which product categories are browsed most (use
Counter) - Unique products viewed — count distinct products a user has seen (use
set) - Recent views — keep the last 3 products viewed (use
deque) - Category index — group products by category for recommendations (use
defaultdict)
from collections import Counter, defaultdict, deque
class ProductTracker:
def __init__(self, recent_size=3):
self.category_counts = Counter() # Top-K categories
self.unique_products = set() # Unique product tracking
self.recent_views = deque(maxlen=recent_size) # Sliding recency window
self.category_index = defaultdict(list) # Product grouping by category
def track_view(self, product_id, category):
"""Record a product view — updates all four structures."""
self.category_counts[category] += 1
self.unique_products.add(product_id)
self.recent_views.append(product_id)
if product_id not in self.category_index[category]:
self.category_index[category].append(product_id)
def top_categories(self, k=3):
return self.category_counts.most_common(k)
def unique_count(self):
return len(self.unique_products)
def get_recent(self):
return list(self.recent_views)
def recommend_from_category(self, category):
return self.category_index.get(category, [])
# Simulate user session
tracker = ProductTracker(recent_size=3)
views = [
("prod_1", "electronics"), ("prod_2", "electronics"),
("prod_3", "books"), ("prod_4", "clothing"),
("prod_5", "electronics"), ("prod_1", "electronics"), # repeat
("prod_6", "books"), ("prod_7", "books"),
("prod_8", "electronics"), ("prod_9", "clothing"),
]
for pid, cat in views:
tracker.track_view(pid, cat)
print(f"Top 3 categories: {tracker.top_categories(3)}")
print(f"Unique products viewed: {tracker.unique_count()}")
print(f"Recent views: {tracker.get_recent()}")
print(f"Recommendations for electronics: {tracker.recommend_from_category('electronics')}")Solution
from collections import Counter, defaultdict, deque
class ProductTracker:
def __init__(self, recent_size=3):
self.category_counts = Counter()
self.unique_products = set()
self.recent_views = deque(maxlen=recent_size)
self.category_index = defaultdict(list)
def track_view(self, product_id, category):
self.category_counts[category] += 1
self.unique_products.add(product_id)
self.recent_views.append(product_id)
if product_id not in self.category_index[category]:
self.category_index[category].append(product_id)
def top_categories(self, k=3):
return self.category_counts.most_common(k)
def unique_count(self):
return len(self.unique_products)
def get_recent(self):
return list(self.recent_views)
def recommend_from_category(self, category):
return self.category_index.get(category, [])
tracker = ProductTracker(recent_size=3)
views = [
("prod_1", "electronics"), ("prod_2", "electronics"),
("prod_3", "books"), ("prod_4", "clothing"),
("prod_5", "electronics"), ("prod_1", "electronics"),
("prod_6", "books"), ("prod_7", "books"),
("prod_8", "electronics"), ("prod_9", "clothing"),
]
for pid, cat in views:
tracker.track_view(pid, cat)
print(f"Top 3 categories: {tracker.top_categories(3)}")
print(f"Unique products viewed: {tracker.unique_count()}")
print(f"Recent views: {tracker.get_recent()}")
print(f"Recommendations for electronics: {tracker.recommend_from_category('electronics')}")
Data structure selection rationale:
Feature | Structure | Why
-----------------------+--------------------+----------------------------------
Top-K categories | Counter | O(1) increment, O(k + m*log(k)) top-k
Unique product count | set | O(1) add + dedup, O(1) len()
Recent N views | deque(maxlen=N) | O(1) append, auto-evicts oldest
Category → products | defaultdict(list) | O(1) append, no KeyError on new key
The engineering lesson: Real systems rarely use a single data structure. Each sub-problem has its own access pattern, and the right design composes multiple structures — each chosen for its O(1) operation on the critical path. This tracker handles 10 views with 4 constant-time updates each = 40 total O(1) operations, regardless of data size.
What would go wrong with a single list?
- Top-K: scan entire list and count — O(n squared)
- Unique count:
len(set(all_products))every time — O(n) - Recent N: slice
[-3:]— O(1) but only if you keep everything in memory - Category lookup: filter entire list — O(n)
Expected Output
Top 3 categories: [('electronics', 5), ('books', 3), ('clothing', 2)]\nUnique products viewed: 7\nRecent views: ['prod_8', 'prod_7', 'prod_6']\nRecommendations for electronics: ['prod_2', 'prod_5', 'prod_8']Hints
Hint 1: Think about what operation each feature needs: counting uses `Counter`, uniqueness uses `set`, recency uses `deque(maxlen=k)`, grouping uses `defaultdict(list)`.
Hint 2: The key engineering skill is selecting the right structure for each sub-problem and composing them. No single data structure solves everything — the art is in the combination.
