Counter and Defaultdict - Smart Dictionaries That Handle Edge Cases
Reading time: ~18 minutes | Level: Foundation → Engineering
Here is a bug that ships to production regularly.
# Grouping users by country
users = [
{"name": "Alice", "country": "US"},
{"name": "Bob", "country": "UK"},
{"name": "Carol", "country": "US"},
]
groups = {}
for user in users:
groups[user["country"]].append(user["name"])
print(groups)
Run this and you get:
KeyError: 'US'
The first time Python sees "US", groups["US"] raises a KeyError because the key does not exist yet. Every developer who writes grouping code hits this at least once. The standard fix adds two lines of boilerplate per grouping operation - check if key exists, initialize, then append.
The correct tool eliminates this entirely and is faster in the process.
from collections import defaultdict
groups = defaultdict(list)
for user in users:
groups[user["country"]].append(user["name"])
print(dict(groups))
# {'US': ['Alice', 'Carol'], 'UK': ['Bob']}
No KeyError. No initialization check. No extra lines. This is what defaultdict and Counter are - purpose-built solutions to problems that recur constantly in real data pipelines.
What You Will Learn
- How
defaultdicteliminatesKeyErrorwith a default factory function - and what the factory actually does internally - The three core patterns:
defaultdict(list),defaultdict(int),defaultdict(set)- when and why each - How
defaultdictcompares todict.setdefault()anddict.get(k, default)- the precise tradeoffs - How
defaultdict(list)implements groupby-style aggregation in one pass - How
Counterworks as a specialized dict optimized for frequency counting - Why
Counter.most_common(n)usesheapqinternally - and what that means for performance - Counter arithmetic:
+,-,&,|operators for combining frequency data - The critical difference between
Counter.subtract()and the-operator - When NOT to use
defaultdict: how it can hide real bugs by silently returning defaults - Real-world patterns: word frequency, inventory management, vote counting, A/B test aggregation
Prerequisites
- Python dictionaries and hash tables (Lesson 06 - Dictionaries Deep Dive)
- Understanding of Python callables - functions are objects
- Basic Big-O notation
- Lesson 13 (deque and collections module overview) for context on the
collectionsmodule
Part 1 - defaultdict: Eliminating the KeyError Pattern
What defaultdict Does Internally
defaultdict is a subclass of dict. It adds a single attribute: default_factory - a callable that is invoked with no arguments whenever a missing key is accessed.
The CPython implementation in _collectionsmodule.c overrides __missing__:
# Pseudocode - what CPython does when you access a missing key
class defaultdict(dict):
def __init__(self, default_factory=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.default_factory = default_factory # store the callable
def __missing__(self, key):
if self.default_factory is None:
raise KeyError(key) # Behaves like plain dict
value = self.default_factory() # Call with NO arguments
self[key] = value # Store the result
self.key = value
return value
Key behavior: self.default_factory() is called with no arguments, the result is stored under the missing key, and then returned. The key is now present in the dict for all future accesses.
from collections import defaultdict
# The factory is a callable - called with no arguments
dd = defaultdict(list)
print(dd.default_factory) # <class 'list'>
# Access a missing key - factory is called
value = dd['new_key']
print(value) # [] ← list() was called
print(dd) # defaultdict(<class 'list'>, {'new_key': []})
print('new_key' in dd) # True ← key was stored!
# Second access - key now exists, factory NOT called again
value2 = dd['new_key']
print(value2 is value) # True ← same list object returned
The factory is called once per missing key, then the result is cached. This is why dd['key'].append(x) works: the first access creates a [] and stores it; subsequent accesses return the same list.
The Three Core Patterns
Pattern 1: defaultdict(list) - Grouping / One-to-Many Mapping
from collections import defaultdict
# Group transactions by user_id
transactions = [
(101, 250.00, "purchase"),
(202, 89.99, "purchase"),
(101, 15.00, "refund"),
(303, 120.00, "purchase"),
(202, 45.00, "purchase"),
(101, 300.00, "purchase"),
]
by_user = defaultdict(list)
for user_id, amount, txn_type in transactions:
by_user[user_id].append({'amount': amount, 'type': txn_type})
for user_id, txns in sorted(by_user.items()):
total = sum(t['amount'] for t in txns if t['type'] == 'purchase')
print(f"User {user_id}: {len(txns)} transactions, ${total:.2f} purchases")
# Output:
# User 101: 3 transactions, $550.00 purchases
# User 202: 2 transactions, $134.99 purchases
# User 303: 1 transactions, $120.00 purchases
Pattern 2: defaultdict(int) - Counting
from collections import defaultdict
# Count word frequencies - manual version
text = "the quick brown fox jumps over the lazy dog the fox"
words = text.split()
freq = defaultdict(int)
for word in words:
freq[word] += 1 # int() returns 0; 0 + 1 = 1 for first occurrence
print(sorted(freq.items(), key=lambda x: -x[1]))
# [('the', 3), ('fox', 2), ('quick', 1), ('brown', 1), ...]
Note: for pure counting, Counter is better (more features, often faster). defaultdict(int) is useful when you're doing arithmetic beyond simple counting.
Pattern 3: defaultdict(set) - Unique Values per Key
from collections import defaultdict
# Find all unique tags used by each author
blog_posts = [
("Alice", "python"),
("Bob", "python"),
("Alice", "data-science"),
("Alice", "python"), # Duplicate - set handles it
("Carol", "devops"),
("Bob", "data-science"),
]
author_tags = defaultdict(set)
for author, tag in blog_posts:
author_tags[author].add(tag)
for author, tags in sorted(author_tags.items()):
print(f"{author}: {sorted(tags)}")
# Output:
# Alice: ['data-science', 'python']
# Bob: ['data-science', 'python']
# Carol: ['devops']
Part 2 - defaultdict vs setdefault vs get
These three approaches solve similar problems but have distinct semantics. Choosing the wrong one creates subtle bugs.
Side-by-Side Comparison
data = [('a', 1), ('b', 2), ('a', 3), ('c', 4)]
# Method 1: dict.get() with explicit fallback
groups_get = {}
for k, v in data:
groups_get[k] = groups_get.get(k, [])
groups_get[k].append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# Problem: get() returns a NEW [] but does not store it.
# The assignment groups_get[k] = groups_get.get(k, []) is needed each time.
# Easy to write incorrectly as: groups_get.get(k, []).append(v) ← BUG (list lost!)
# Method 2: dict.setdefault()
groups_sd = {}
for k, v in data:
groups_sd.setdefault(k, []).append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# setdefault stores the default AND returns it - correct and concise.
# But: always constructs the default object even if key exists (minor waste).
# Method 3: defaultdict
from collections import defaultdict
groups_dd = defaultdict(list)
for k, v in data:
groups_dd[k].append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# Cleanest. Factory called only when key is missing.
When to Use Each
dict.get(k, default):
✓ One-off read with fallback - no storage needed
✓ When you DON'T want to create the key if missing
✗ Grouping/mutating pattern (must reassign; easy to forget)
dict.setdefault(k, default):
✓ When you want get-or-create in a plain dict (no import needed)
✓ Backward compatible code, third-party API that expects dict
✗ Slightly wasteful - default object created even if key exists
✗ Less readable for grouping patterns
defaultdict(factory):
✓ Grouping, aggregation, adjacency lists - the idiomatic choice
✓ Factory only called when needed
✗ Accessing a missing key CREATES it (can hide bugs - see Part 4)
✗ Type is defaultdict, not dict (may matter for type-strict APIs)
The get() Bug That Bites Beginners
# Common bug: using get() for mutable accumulation
data = [('a', 1), ('b', 2), ('a', 3)]
groups = {}
for k, v in data:
groups.get(k, []).append(v) # BUG: list returned but never stored!
print(groups) # {} - empty! All appends went to throwaway lists.
# The fix - must assign:
groups = {}
for k, v in data:
lst = groups.get(k, [])
lst.append(v)
groups[k] = lst # Now store it
# Or simpler: use setdefault
groups = {}
for k, v in data:
groups.setdefault(k, []).append(v) # setdefault returns stored reference
Part 3 - defaultdict for Groupby-Style Aggregation
Building an Adjacency List for Graph Algorithms
from collections import defaultdict
def build_adjacency_list(edges):
"""
Build undirected graph from edge list.
defaultdict(list) is the standard representation.
"""
graph = defaultdict(list)
for u, v in edges:
graph[u].append(v)
graph[v].append(u) # Undirected
return graph
edges = [(1, 2), (1, 3), (2, 4), (3, 4), (4, 5)]
graph = build_adjacency_list(edges)
print(dict(graph))
# {1: [2, 3], 2: [1, 4], 3: [1, 4], 4: [2, 3, 5], 5: [4]}
# BFS using the adjacency list
def bfs(graph, start):
from collections import deque
visited = {start}
queue = deque([start])
order = []
while queue:
node = queue.popleft()
order.append(node)
for neighbor in graph[node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
return order
print(bfs(graph, 1)) # [1, 2, 3, 4, 5]
Multi-Level Grouping
from collections import defaultdict
# Group sales by (region, product_category)
sales = [
("West", "Electronics", 1500),
("East", "Clothing", 800),
("West", "Clothing", 600),
("East", "Electronics", 2100),
("West", "Electronics", 900),
("East", "Clothing", 400),
]
# Nested defaultdict - group by region, then by category
by_region_category = defaultdict(lambda: defaultdict(list))
for region, category, amount in sales:
by_region_category[region][category].append(amount)
for region, categories in sorted(by_region_category.items()):
for category, amounts in sorted(categories.items()):
total = sum(amounts)
print(f"{region} / {category}: ${total:,} ({len(amounts)} sales)")
# Output:
# East / Clothing: $1,200 (2 sales)
# East / Electronics: $2,100 (1 sales)
# West / Clothing: $600 (1 sales)
# West / Electronics: $2,400 (2 sales)
:::tip lambda: defaultdict(list) for Nested Groups
defaultdict(lambda: defaultdict(list)) creates a two-level grouping structure. The outer defaultdict factory is a lambda that returns a new defaultdict(list) each time a missing region is accessed. This pattern handles arbitrary nesting but note that the resulting object is not directly serializable to JSON - convert with json.dumps(dict(dict(d) for d in ...)) or use a recursive conversion.
:::
Part 4 - When NOT to Use defaultdict
The Silent Bug Problem
defaultdict auto-creates keys on access. This can hide real bugs:
from collections import defaultdict
config = defaultdict(str)
config['host'] = 'db.prod.internal'
config['port'] = '5432'
# Typo in key name - no error raised!
hostname = config['hsot'] # 'hsot' - typo!
print(hostname) # '' ← empty string, not an error
print('hsot' in config) # True ← typo key now exists in config!
# With a plain dict, this would be an immediate KeyError - easy to catch
plain_config = {'host': 'db.prod.internal', 'port': '5432'}
try:
hostname = plain_config['hsot']
except KeyError as e:
print(f"Bug caught immediately: {e}") # Bug caught immediately: 'hsot'
The Rule: defaultdict for Accumulation, dict for Lookup
# RIGHT: defaultdict for accumulation (you WANT missing keys to initialize)
from collections import defaultdict
index = defaultdict(list)
for doc_id, word in document_word_pairs:
index[word].append(doc_id) # Always appending - missing = new list
# WRONG: defaultdict for config/record lookup (missing key = bug, not default)
settings = defaultdict(bool)
settings['feature_x_enabled'] = True
# This silently returns False for a typo instead of raising KeyError:
if settings['feature_X_enabled']: # Capital X - typo!
enable_feature_x() # Never called - bug hidden
# RIGHT for config: use plain dict, or dict.get() with an explicit default
settings = {'feature_x_enabled': True}
if settings.get('feature_x_enabled', False):
enable_feature_x()
:::danger defaultdict Hides KeyError
KeyError is often a signal of a real bug - a wrong key, a typo, or a missing configuration. defaultdict silences it. Use defaultdict only for accumulation patterns (grouping, counting, adjacency lists) where a missing key genuinely means "start with an empty container." For lookup patterns (config, records, caches), use plain dict so bugs are caught immediately.
:::
Part 5 - collections.Counter: Frequency Counting as a First-Class Operation
Counter is a dict Subclass
Counter inherits from dict. Every key maps to its count (an integer). Accessing a missing key returns 0 - not a KeyError and not by creating the key permanently (unlike defaultdict):
from collections import Counter
# Create from iterable
words = "the quick brown fox jumps over the lazy dog the fox".split()
freq = Counter(words)
print(freq)
# Counter({'the': 3, 'fox': 2, 'quick': 1, 'brown': 1, ...})
print(freq['the']) # 3
print(freq['missing']) # 0 ← no KeyError, key NOT added to dict
print('missing' in freq) # False ← key was not stored
# Create from dict
inventory = Counter({'apple': 50, 'banana': 20, 'cherry': 5})
# Create from keyword arguments
votes = Counter(alice=120, bob=95, carol=87)
# Create empty, then update
c = Counter()
c.update("abracadabra") # Can update with string (counts each char)
print(c)
# Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
most_common(n) Uses heapq Internally
Counter.most_common(n) returns the n most common elements and their counts, from most to least frequent.
freq = Counter("abracadabra")
# All elements sorted by count (most common first)
print(freq.most_common())
# [('a', 5), ('b', 2), ('r', 2), ('c', 1), ('d', 1)]
# Top 2 only
print(freq.most_common(2))
# [('a', 5), ('b', 2)]
Internally, when n is specified and n < len(counter) / 2, CPython uses heapq.nlargest - which runs in O(k + n log k) time where k is the total number of unique elements, n is how many you want. This is faster than sorting the full dict (O(k log k)) when you only need a small fraction of the top elements.
most_common(n) complexity:
n == None : O(k log k) - full sort of all k unique elements
n < k : O(k + n log k) - heapq.nlargest, faster for small n
n == 1 : O(k) - single linear scan for maximum
Where k = number of unique elements (not total elements)
import heapq
# What most_common() does internally for n < len(counter):
def most_common_manual(counter, n):
if n is None:
return sorted(counter.items(), key=lambda x: -x[1])
return heapq.nlargest(n, counter.items(), key=lambda x: x[1])
freq = Counter("abracadabra")
print(most_common_manual(freq, 2))
# [('a', 5), ('b', 2)]
Real-World: Word Frequency Analysis
from collections import Counter
import re
def top_words(text: str, n: int = 10, min_length: int = 4):
"""
Find top N words by frequency, excluding short words.
"""
# Tokenize: lowercase, keep only alphabetic words
words = re.findall(r'\b[a-z]{' + str(min_length) + r',}\b', text.lower())
counter = Counter(words)
return counter.most_common(n)
sample = """
Python is a high-level programming language known for its readability.
Python emphasizes code readability and simplicity. The Python community
values readable code over clever code. Many data science libraries
are written in Python because Python makes data manipulation easy.
"""
print(top_words(sample, n=5))
# [('python', 4), ('code', 3), ('readable', 2), ('readability', 2), ('data', 2)]
Part 6 - Counter Arithmetic: Operating on Frequency Data
Counter supports four arithmetic operators for combining frequency counts. This makes it a natural multiset - a generalization of set where elements can appear multiple times.
The Four Operators
from collections import Counter
inventory_A = Counter({'apples': 10, 'oranges': 5, 'bananas': 8})
inventory_B = Counter({'apples': 3, 'oranges': 7, 'grapes': 4})
# + : Combine counts (union-like, keeps all elements)
combined = inventory_A + inventory_B
print(combined)
# Counter({'apples': 13, 'bananas': 8, 'oranges': 12, 'grapes': 4})
# - : Subtract counts, drop zero/negative (difference)
after_sale = inventory_A - inventory_B
print(after_sale)
# Counter({'apples': 7, 'bananas': 8})
# oranges: 5-7 = -2 → dropped (negative → not included)
# grapes: 0-4 = -4 → dropped
# & : Intersection - minimum of corresponding counts
minimum_stock = inventory_A & inventory_B
print(minimum_stock)
# Counter({'apples': 3, 'oranges': 5})
# Only elements in BOTH, with the lower count
# | : Union - maximum of corresponding counts
maximum_stock = inventory_A | inventory_B
print(maximum_stock)
# Counter({'apples': 10, 'oranges': 7, 'bananas': 8, 'grapes': 4})
subtract() vs the - Operator
This is one of the most important distinctions in Counter:
from collections import Counter
# - operator: drops zero and negative counts
a = Counter({'x': 3, 'y': 1, 'z': 5})
b = Counter({'x': 5, 'y': 1, 'z': 2})
result_minus = a - b
print(result_minus)
# Counter({'z': 3})
# x: 3-5 = -2 → dropped
# y: 1-1 = 0 → dropped
# z: 5-2 = 3 → kept
# subtract(): KEEPS zero and negative counts
a_copy = Counter({'x': 3, 'y': 1, 'z': 5})
a_copy.subtract(b)
print(a_copy)
# Counter({'z': 3, 'y': 0, 'x': -2})
# All results kept, including zero and negative
When subtract() Matters
from collections import Counter
# Tracking inventory changes - need to know what went negative (backorders)
current_stock = Counter({'widget_A': 10, 'widget_B': 5, 'widget_C': 3})
order = Counter({'widget_A': 7, 'widget_B': 8, 'widget_C': 1})
# Use subtract to see actual delta including backorders
current_stock.subtract(order)
print(current_stock)
# Counter({'widget_A': 3, 'widget_C': 2, 'widget_B': -3})
# Separate fulfilled from backlogged
fulfilled = {k: v for k, v in current_stock.items() if v >= 0}
backorders = {k: -v for k, v in current_stock.items() if v < 0}
print("Fulfilled:", fulfilled) # {'widget_A': 3, 'widget_C': 2}
print("Backorders:", backorders) # {'widget_B': 3}
# With - operator, backorders disappear silently:
stock2 = Counter({'widget_A': 10, 'widget_B': 5, 'widget_C': 3})
print(stock2 - order)
# Counter({'widget_A': 3, 'widget_C': 2})
# widget_B: -3 silently dropped - backorder information lost!
:::warning subtract() vs - : Critical Difference
Use the - operator when you want only positive-count results (clean multiset difference). Use subtract() when you need to preserve zero and negative counts - inventory deficits, statistical residuals, debt tracking.
:::
Part 7 - Counter as a Multiset
A multiset is a set that allows duplicate elements. Counter is Python's multiset implementation:
from collections import Counter
# A/B test result aggregation
# Group A: 120 conversions, 400 non-conversions
# Group B: 95 conversions, 450 non-conversions
group_a = Counter(conversion=120, no_conversion=400)
group_b = Counter(conversion=95, no_conversion=450)
total = group_a + group_b
print(total)
# Counter({'no_conversion': 850, 'conversion': 215})
conversion_rate_a = group_a['conversion'] / sum(group_a.values())
conversion_rate_b = group_b['conversion'] / sum(group_b.values())
print(f"Group A conversion: {conversion_rate_a:.1%}") # 23.1%
print(f"Group B conversion: {conversion_rate_b:.1%}") # 17.4%
# Elements() - iterate over elements with repetition (like expanding the multiset)
small = Counter({'a': 3, 'b': 2})
print(list(small.elements())) # ['a', 'a', 'a', 'b', 'b']
# Useful for reconstructing a dataset from a frequency count
import random
population = list(small.elements())
sample = random.choices(population, k=10) # Sample from the distribution
Part 8 - Production Examples
Example 1: Vote Counting System
from collections import Counter
from typing import List, Tuple
def tally_votes(votes: List[str]) -> dict:
"""
Tally election votes with full results.
O(n) time where n = total votes.
"""
tally = Counter(votes)
total = sum(tally.values())
results = []
for candidate, count in tally.most_common():
pct = count / total * 100
results.append({
'candidate': candidate,
'votes': count,
'percentage': round(pct, 2),
})
return {
'results': results,
'total_votes': total,
'winner': tally.most_common(1)[0][0],
}
ballots = (
['Alice'] * 4200 +
['Bob'] * 3800 +
['Carol'] * 1500 +
['Dave'] * 500
)
import random
random.shuffle(ballots)
report = tally_votes(ballots)
print(f"Winner: {report['winner']}")
for r in report['results']:
bar = '█' * (r['votes'] // 200)
print(f" {r['candidate']:10s}: {r['votes']:5d} ({r['percentage']:5.1f}%) {bar}")
# Output:
# Winner: Alice
# Alice : 4200 (42.0%) █████████████████████
# Bob : 3800 (38.0%) ███████████████████
# Carol : 1500 (15.0%) ███████
# Dave : 500 ( 5.0%) ██
Example 2: Log Aggregation Pipeline
from collections import Counter, defaultdict
import re
from datetime import datetime
def analyze_access_log(log_lines: list) -> dict:
"""
Parse and aggregate web server access log.
Returns status code counts, top URLs, and error breakdown.
"""
# Counters for different dimensions
status_counts = Counter()
url_counts = Counter()
error_urls = defaultdict(Counter) # error_urls[status_code][url] = count
# Regex for common log format: IP - - [date] "METHOD URL HTTP" STATUS bytes
pattern = re.compile(r'"(?:GET|POST|PUT|DELETE) (/\S*) HTTP/[\d.]+" (\d{3})')
for line in log_lines:
match = pattern.search(line)
if not match:
continue
url, status = match.group(1), match.group(2)
status_counts[status] += 1
url_counts[url] += 1
if status.startswith('4') or status.startswith('5'):
error_urls[status][url] += 1
return {
'total_requests': sum(status_counts.values()),
'status_breakdown': dict(status_counts.most_common()),
'top_10_urls': url_counts.most_common(10),
'error_breakdown': {s: dict(urls.most_common(5)) for s, urls in error_urls.items()},
}
# Simulated log lines
sample_logs = [
'1.2.3.4 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 200 1234',
'5.6.7.8 - - [01/Mar/2026] "GET /api/courses HTTP/1.1" 200 4567',
'9.10.11.12 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 404 89',
'1.2.3.4 - - [01/Mar/2026] "POST /api/login HTTP/1.1" 200 256',
'13.14.15.16 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 500 0',
]
report = analyze_access_log(sample_logs)
print(f"Total requests: {report['total_requests']}")
print(f"Status breakdown: {report['status_breakdown']}")
print(f"Top URLs: {report['top_10_urls']}")
Example 3: Inventory Management with Counter Arithmetic
from collections import Counter
class Inventory:
"""
Warehouse inventory tracker using Counter.
Supports receiving stock, fulfilling orders, and backorder detection.
"""
def __init__(self):
self._stock = Counter()
def receive(self, shipment: dict):
"""Add stock from a shipment."""
self._stock.update(shipment)
print(f"Received: {shipment}")
def fulfill_order(self, order: dict) -> dict:
"""
Attempt to fulfill an order.
Returns fulfilled items and backorders separately.
"""
order_counter = Counter(order)
before = Counter(self._stock)
self._stock.subtract(order_counter)
fulfilled = {}
backorders = {}
for item, qty in order.items():
available_before = before[item]
if available_before >= qty:
fulfilled[item] = qty
elif available_before > 0:
fulfilled[item] = available_before
backorders[item] = qty - available_before
else:
backorders[item] = qty
# Keep stock non-negative (remove backorder debt from stock)
if self._stock[item] < 0:
self._stock[item] = 0
return {'fulfilled': fulfilled, 'backorders': backorders}
def stock_report(self):
return dict(self._stock.most_common())
inv = Inventory()
inv.receive({'widget': 100, 'gadget': 50, 'doohickey': 10})
inv.receive({'widget': 20, 'sprocket': 30})
result = inv.fulfill_order({'widget': 80, 'gadget': 60, 'sprocket': 15})
print("Fulfilled:", result['fulfilled'])
# Fulfilled: {'widget': 80, 'gadget': 50, 'sprocket': 15}
print("Backorders:", result['backorders'])
# Backorders: {'gadget': 10}
print("Remaining stock:", inv.stock_report())
# Remaining stock: {'widget': 40, 'sprocket': 15, 'doohickey': 10, 'gadget': 0}
Interview Questions
Q1: What is defaultdict and how does it differ from a plain dict?
Answer: defaultdict is a subclass of dict that adds a default_factory callable. When a missing key is accessed, defaultdict calls default_factory() with no arguments, stores the result under that key, and returns it - instead of raising KeyError. A plain dict raises KeyError on missing key access. The key difference: defaultdict permanently stores the default value (the key is created), while dict.get(k, default) returns a fallback without storing it. Use defaultdict for accumulation patterns (grouping, counting), not for record lookup where KeyError should signal a bug.
Q2: Explain Counter.most_common(n) - what does it return, and what algorithm does it use internally?
Answer: most_common(n) returns a list of (element, count) tuples, ordered from most to least frequent, limited to the top n elements. When n is specified and less than the total unique elements, CPython uses heapq.nlargest(n, counter.items(), key=lambda x: x[1]) - which runs in O(k + n log k) where k is the number of unique elements. This is faster than a full sort O(k log k) when you only need a small fraction. When n is None (all elements), it uses a full sort. When n=1, an O(k) linear scan is used.
Q3: What is the difference between Counter.subtract() and the - operator?
Answer: The - operator computes a multiset difference that only keeps positive counts - zero and negative results are discarded. Counter({'a': 3}) - Counter({'a': 5}) returns Counter() because 3-5=-2 is negative and dropped. Counter.subtract() computes the same arithmetic but keeps all results including zero and negative - Counter({'a': 3}).subtract(Counter({'a': 5})) leaves Counter({'a': -2}). Use - when you want clean set-difference semantics. Use subtract() when the negative count carries meaning (e.g., backorders, deficits, debt).
Q4: When should you NOT use defaultdict?
Answer: When a KeyError on missing key access should be treated as a bug. For example: reading configuration values (a missing config key is a misconfiguration, not a "use default" signal), accessing records by ID (a missing ID means the record doesn't exist), or any lookup where missing means wrong input. defaultdict silently creates keys on access - a typo in a key name (config['hsot'] instead of config['host']) returns an empty default and writes the typo key into the dict, potentially masking the bug for a long time. Use defaultdict only when missing = "start with an empty container," not when missing = "this is an error."
Q5: How would you use Counter and defaultdict together to implement word co-occurrence counting?
Answer: Use defaultdict(Counter) - each word maps to a Counter of words that appear near it.
from collections import defaultdict, Counter
def build_cooccurrence(sentences: list, window: int = 2) -> dict:
cooccur = defaultdict(Counter)
for sentence in sentences:
words = sentence.lower().split()
for i, word in enumerate(words):
for j in range(max(0, i - window), min(len(words), i + window + 1)):
if i != j:
cooccur[word][words[j]] += 1
return cooccur
sentences = ["the cat sat on the mat", "the cat ate the rat"]
result = build_cooccurrence(sentences)
print(result['cat'].most_common(3))
# [('the', 4), ('sat', 1), ('ate', 1)]
Q6: What is the time and space complexity of building a Counter from a list of n elements with k unique values?
Answer: Time: O(n) - each element requires one dict lookup and one increment, both O(1) average due to hash table. Space: O(k) - the Counter stores at most k key-value pairs (one per unique element), regardless of n. For example, counting character frequencies in a 1 MB string still uses O(1) space (at most 128 ASCII or ~150k Unicode code points) regardless of string length. The most_common() call adds O(k log k) time for the sort. This makes Counter very memory-efficient for highly repetitive data.
Practice Challenges
Level 1 - Predict the Output
What does this print?
from collections import Counter
a = Counter("aabbcc")
b = Counter("bccdd")
print(a + b)
print(a - b)
print(a & b)
print(a | b)
Solution
Output:
Counter({'c': 4, 'b': 3, 'a': 2, 'd': 2})
Counter({'a': 2, 'b': 1})
Counter({'b': 1, 'c': 2})
Counter({'a': 2, 'b': 2, 'c': 2, 'd': 2})
a = {'a':2, 'b':2, 'c':2},b = {'b':1, 'c':2, 'd':2}a + b: add all counts:{'a':2, 'b':3, 'c':4, 'd':2}a - b: subtract, drop ≤0:a='a':2 (stays), 'b':2-1=1 (stays), 'c':2-2=0 (dropped), d:0-2=-2 (dropped)→{'a':2, 'b':1}a & b: minimum per key, only keys in both:'b':min(2,1)=1, 'c':min(2,2)=2→{'b':1, 'c':2}a | b: maximum per key, all keys:'a':2, 'b':max(2,1)=2, 'c':2, 'd':2
Level 2 - Implement a Streaming Top-K Monitor
Build a class that processes a stream of events and always reports the top K most frequent event types. It must support:
record(event_type)- O(1) amortizedtop_k()- returns the k most frequent types seen so far
monitor = TopKMonitor(k=3)
for event in ["login", "click", "login", "purchase", "click", "login", "view", "click"]:
monitor.record(event)
print(monitor.top_k())
# [('login', 3), ('click', 3), ('purchase', 1)]
# (or any top 3 by count - ties resolved by arrival order)
Solution
from collections import Counter
class TopKMonitor:
"""
Tracks top-k most frequent events in a stream.
record(): O(1) amortized - Counter update is O(1) average
top_k(): O(n + k log n) where n = unique event types - uses heapq internally
Space: O(n) where n = unique event types seen
"""
def __init__(self, k: int):
self.k = k
self._counts = Counter()
def record(self, event_type: str):
self._counts[event_type] += 1 # O(1) average
def top_k(self):
return self._counts.most_common(self.k)
monitor = TopKMonitor(k=3)
events = ["login", "click", "login", "purchase", "click", "login", "view", "click"]
for event in events:
monitor.record(event)
print(monitor.top_k())
# [('login', 3), ('click', 3), ('purchase', 1)]
Why this works: Counter is a dict subclass - counter[key] += 1 is O(1) average. most_common(k) uses heapq.nlargest - O(n + k log n) where n is unique event types. For a monitor where k is small (e.g., top 10) but event types might be thousands, this is significantly faster than a full sort.
Extension: For a true streaming system, you'd want a sliding window (last T seconds). That requires a different approach: a deque(maxlen=T) to store recent events, and rebuilding the counter periodically or using a decrement-on-evict strategy.
Level 3 - Build a Dependency Graph Resolver
You have a list of tasks where each task may depend on other tasks. Build a system using defaultdict that:
- Builds a dependency graph
- Detects circular dependencies
- Returns a valid execution order (topological sort)
tasks = {
'deploy': ['test', 'build'],
'test': ['build', 'lint'],
'build': ['lint'],
'lint': [],
'notify': ['deploy'],
}
resolver = DependencyResolver(tasks)
print(resolver.execution_order())
# ['lint', 'build', 'test', 'deploy', 'notify']
# (lint has no deps → runs first; notify depends on deploy → runs last)
# Add a circular dependency:
tasks['lint'] = ['notify'] # lint → notify → deploy → test → lint → circular!
resolver2 = DependencyResolver(tasks)
print(resolver2.execution_order()) # Raises: CircularDependencyError
Solution
from collections import defaultdict, deque
class CircularDependencyError(Exception):
pass
class DependencyResolver:
"""
Topological sort using Kahn's algorithm (BFS-based).
Uses defaultdict for the graph and in-degree tracking.
Time: O(V + E) where V = tasks, E = total dependencies
Space: O(V + E) for the graph and in-degree table
"""
def __init__(self, tasks: dict):
self.tasks = tasks
# Build adjacency list: dependency → [tasks that depend on it]
self.graph = defaultdict(list) # reverse edges for topological sort
self.in_degree = defaultdict(int) # number of unresolved deps per task
# Initialize all tasks (including those with no deps)
for task in tasks:
self.in_degree[task] # Access to initialize to 0 via defaultdict
for task, deps in tasks.items():
for dep in deps:
self.graph[dep].append(task) # dep must run before task
self.in_degree[task] += 1
def execution_order(self) -> list:
"""
Kahn's algorithm: process nodes with in_degree=0 first.
If we can't process all nodes → circular dependency.
"""
# Start with all tasks that have no dependencies
queue = deque(
task for task, degree in self.in_degree.items() if degree == 0
)
order = []
while queue:
task = queue.popleft()
order.append(task)
# For each task that depended on this one:
for dependent in self.graph[task]:
self.in_degree[dependent] -= 1
if self.in_degree[dependent] == 0:
queue.append(dependent)
if len(order) != len(self.tasks):
remaining = [t for t in self.tasks if t not in order]
raise CircularDependencyError(
f"Circular dependency detected involving: {remaining}"
)
return order
# Test valid dependency graph
tasks = {
'deploy': ['test', 'build'],
'test': ['build', 'lint'],
'build': ['lint'],
'lint': [],
'notify': ['deploy'],
}
resolver = DependencyResolver(tasks)
print(resolver.execution_order())
# ['lint', 'build', 'test', 'deploy', 'notify']
# Test circular dependency
tasks_circular = {
'a': ['b'],
'b': ['c'],
'c': ['a'], # Circular: a → b → c → a
}
resolver2 = DependencyResolver(tasks_circular)
try:
resolver2.execution_order()
except CircularDependencyError as e:
print(e) # Circular dependency detected involving: ['a', 'b', 'c']
Key design decisions:
defaultdict(list)for the reverse adjacency graph - no initialization needed per nodedefaultdict(int)for in-degree counts - missing key auto-initializes to 0dequefor the BFS queue - O(1) popleft, appropriate for BFS (see Lesson 13)- Kahn's algorithm naturally detects cycles: if the output has fewer nodes than the input, a cycle exists among the unprocessed nodes
Quick Reference
| Tool | Key Pattern | Code | When to Use |
|---|---|---|---|
defaultdict(list) | Group items by key | dd[k].append(v) | One-to-many mapping, adjacency lists |
defaultdict(int) | Count with arithmetic | dd[k] += amount | Sums, deltas (use Counter for plain counts) |
defaultdict(set) | Unique values per key | dd[k].add(v) | Deduplicated grouping |
defaultdict(Counter) | Nested frequency counts | dd[k][v] += 1 | Co-occurrence, multi-dim counts |
dict.get(k, default) | Read with fallback | d.get(k, 0) | Single-read, don't create missing key |
dict.setdefault(k, []) | Get-or-create in plain dict | .setdefault(k, []).append(v) | When you need a plain dict |
Counter(iterable) | Count all elements | Counter(words) | Frequency analysis |
Counter.most_common(n) | Top N elements | .most_common(5) | Leaderboards, histograms |
Counter + Counter | Combine counts | a + b | Union of frequency data |
Counter - Counter | Remove counts (drop ≤0) | a - b | Clean multiset difference |
Counter.subtract() | Subtract (keep negatives) | a.subtract(b) | Deficit/backorder tracking |
Counter & Counter | Minimum counts | a & b | Common elements, minimum stock |
| `Counter | Counter` | Maximum counts | `a |
Counter.elements() | Expand to multiset | list(c.elements()) | Reconstruct a dataset from frequencies |
Key Takeaways
defaultdicteliminates theKeyErrorpattern for accumulation by calling a factory callable when a key is missing - uselistfor grouping,intfor counting,setfor unique-per-key, but never use it for configuration or record lookup where missing should be an errordict.get(k, default)returns a fallback without storing it;setdefault(k, default)returns and stores it;defaultdictautomatically stores the factory result - know which semantics you need before choosingCounteris a dict subclass that returns0for missing keys (without storing them), supportsmost_common(n)viaheapq.nlargestfor O(k + n log k) top-k retrieval, and behaves as a mathematical multiset with+,-,&,|operatorsCounter.subtract()preserves zero and negative counts; the-operator drops them - this distinction matters for inventory deficits, statistical residuals, and debt tracking- Both
Counteranddefaultdictare O(1) average for inserts and lookups - the hash table underneath is the same asdict, so the overhead is negligible at any realistic scale defaultdictcan hide bugs by silently returning defaults for typo'd keys - always audit whether a missing key is "expected" or "a mistake" before choosingdefaultdictover plaindict- For the groupby pattern (
defaultdict(list)), one pass over the data is sufficient - no sorting required and no intermediate storage
