Counter and Defaultdict - Smart Dictionaries That Handle Edge Cases

Reading time: ~18 minutes | Level: Foundation → Engineering

Here is a bug that ships to production regularly.

# Grouping users by country
users = [
    {"name": "Alice", "country": "US"},
    {"name": "Bob",   "country": "UK"},
    {"name": "Carol", "country": "US"},
]

groups = {}
for user in users:
    groups[user["country"]].append(user["name"])

print(groups)

Run this and you get:

KeyError: 'US'

The first time Python sees "US", groups["US"] raises a KeyError because the key does not exist yet. Every developer who writes grouping code hits this at least once. The standard fix adds two lines of boilerplate per grouping operation - check if key exists, initialize, then append.

The correct tool eliminates this entirely and is faster in the process.

from collections import defaultdict

groups = defaultdict(list)
for user in users:
    groups[user["country"]].append(user["name"])

print(dict(groups))
# {'US': ['Alice', 'Carol'], 'UK': ['Bob']}

No KeyError. No initialization check. No extra lines. This is what defaultdict and Counter are - purpose-built solutions to problems that recur constantly in real data pipelines.

What You Will Learn

How defaultdict eliminates KeyError with a default factory function - and what the factory actually does internally
The three core patterns: defaultdict(list), defaultdict(int), defaultdict(set) - when and why each
How defaultdict compares to dict.setdefault() and dict.get(k, default) - the precise tradeoffs
How defaultdict(list) implements groupby-style aggregation in one pass
How Counter works as a specialized dict optimized for frequency counting
Why Counter.most_common(n) uses heapq internally - and what that means for performance
Counter arithmetic: +, -, &, | operators for combining frequency data
The critical difference between Counter.subtract() and the - operator
When NOT to use defaultdict: how it can hide real bugs by silently returning defaults
Real-world patterns: word frequency, inventory management, vote counting, A/B test aggregation

Prerequisites

Python dictionaries and hash tables (Lesson 06 - Dictionaries Deep Dive)
Understanding of Python callables - functions are objects
Basic Big-O notation
Lesson 13 (deque and collections module overview) for context on the collections module

Part 1 - defaultdict: Eliminating the KeyError Pattern

What defaultdict Does Internally

defaultdict is a subclass of dict. It adds a single attribute: default_factory - a callable that is invoked with no arguments whenever a missing key is accessed.

The CPython implementation in _collectionsmodule.c overrides __missing__:

# Pseudocode - what CPython does when you access a missing key
class defaultdict(dict):
    def __init__(self, default_factory=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.default_factory = default_factory   # store the callable

    def __missing__(self, key):
        if self.default_factory is None:
            raise KeyError(key)    # Behaves like plain dict
        value = self.default_factory()           # Call with NO arguments
        self[key] = value                        # Store the result
        self.key = value
        return value

Key behavior: self.default_factory() is called with no arguments, the result is stored under the missing key, and then returned. The key is now present in the dict for all future accesses.

from collections import defaultdict

# The factory is a callable - called with no arguments
dd = defaultdict(list)
print(dd.default_factory)   # <class 'list'>

# Access a missing key - factory is called
value = dd['new_key']
print(value)              # []   ← list() was called
print(dd)                 # defaultdict(<class 'list'>, {'new_key': []})
print('new_key' in dd)    # True ← key was stored!

# Second access - key now exists, factory NOT called again
value2 = dd['new_key']
print(value2 is value)    # True ← same list object returned

The factory is called once per missing key, then the result is cached. This is why dd['key'].append(x) works: the first access creates a [] and stores it; subsequent accesses return the same list.

The Three Core Patterns

Pattern 1: defaultdict(list) - Grouping / One-to-Many Mapping

from collections import defaultdict

# Group transactions by user_id
transactions = [
    (101, 250.00, "purchase"),
    (202, 89.99,  "purchase"),
    (101, 15.00,  "refund"),
    (303, 120.00, "purchase"),
    (202, 45.00,  "purchase"),
    (101, 300.00, "purchase"),
]

by_user = defaultdict(list)
for user_id, amount, txn_type in transactions:
    by_user[user_id].append({'amount': amount, 'type': txn_type})

for user_id, txns in sorted(by_user.items()):
    total = sum(t['amount'] for t in txns if t['type'] == 'purchase')
    print(f"User {user_id}: {len(txns)} transactions, ${total:.2f} purchases")

# Output:
# User 101: 3 transactions, $550.00 purchases
# User 202: 2 transactions, $134.99 purchases
# User 303: 1 transactions, $120.00 purchases

Pattern 2: defaultdict(int) - Counting

from collections import defaultdict

# Count word frequencies - manual version
text = "the quick brown fox jumps over the lazy dog the fox"
words = text.split()

freq = defaultdict(int)
for word in words:
    freq[word] += 1    # int() returns 0; 0 + 1 = 1 for first occurrence

print(sorted(freq.items(), key=lambda x: -x[1]))
# [('the', 3), ('fox', 2), ('quick', 1), ('brown', 1), ...]

Note: for pure counting, Counter is better (more features, often faster). defaultdict(int) is useful when you're doing arithmetic beyond simple counting.

Pattern 3: defaultdict(set) - Unique Values per Key

from collections import defaultdict

# Find all unique tags used by each author
blog_posts = [
    ("Alice", "python"),
    ("Bob",   "python"),
    ("Alice", "data-science"),
    ("Alice", "python"),   # Duplicate - set handles it
    ("Carol", "devops"),
    ("Bob",   "data-science"),
]

author_tags = defaultdict(set)
for author, tag in blog_posts:
    author_tags[author].add(tag)

for author, tags in sorted(author_tags.items()):
    print(f"{author}: {sorted(tags)}")

# Output:
# Alice: ['data-science', 'python']
# Bob: ['data-science', 'python']
# Carol: ['devops']

Part 2 - defaultdict vs setdefault vs get

These three approaches solve similar problems but have distinct semantics. Choosing the wrong one creates subtle bugs.

Side-by-Side Comparison

data = [('a', 1), ('b', 2), ('a', 3), ('c', 4)]

# Method 1: dict.get() with explicit fallback
groups_get = {}
for k, v in data:
    groups_get[k] = groups_get.get(k, [])
    groups_get[k].append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# Problem: get() returns a NEW [] but does not store it.
# The assignment groups_get[k] = groups_get.get(k, []) is needed each time.
# Easy to write incorrectly as: groups_get.get(k, []).append(v)  ← BUG (list lost!)

# Method 2: dict.setdefault()
groups_sd = {}
for k, v in data:
    groups_sd.setdefault(k, []).append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# setdefault stores the default AND returns it - correct and concise.
# But: always constructs the default object even if key exists (minor waste).

# Method 3: defaultdict
from collections import defaultdict
groups_dd = defaultdict(list)
for k, v in data:
    groups_dd[k].append(v)
# Output: {'a': [1, 3], 'b': [2], 'c': [4]}
# Cleanest. Factory called only when key is missing.

When to Use Each

dict.get(k, default):
  ✓ One-off read with fallback - no storage needed
  ✓ When you DON'T want to create the key if missing
  ✗ Grouping/mutating pattern (must reassign; easy to forget)

dict.setdefault(k, default):
  ✓ When you want get-or-create in a plain dict (no import needed)
  ✓ Backward compatible code, third-party API that expects dict
  ✗ Slightly wasteful - default object created even if key exists
  ✗ Less readable for grouping patterns

defaultdict(factory):
  ✓ Grouping, aggregation, adjacency lists - the idiomatic choice
  ✓ Factory only called when needed
  ✗ Accessing a missing key CREATES it (can hide bugs - see Part 4)
  ✗ Type is defaultdict, not dict (may matter for type-strict APIs)

The get() Bug That Bites Beginners

# Common bug: using get() for mutable accumulation
data = [('a', 1), ('b', 2), ('a', 3)]
groups = {}
for k, v in data:
    groups.get(k, []).append(v)   # BUG: list returned but never stored!

print(groups)   # {} - empty! All appends went to throwaway lists.

# The fix - must assign:
groups = {}
for k, v in data:
    lst = groups.get(k, [])
    lst.append(v)
    groups[k] = lst   # Now store it

# Or simpler: use setdefault
groups = {}
for k, v in data:
    groups.setdefault(k, []).append(v)  # setdefault returns stored reference

Part 3 - defaultdict for Groupby-Style Aggregation

Building an Adjacency List for Graph Algorithms

from collections import defaultdict

def build_adjacency_list(edges):
    """
    Build undirected graph from edge list.
    defaultdict(list) is the standard representation.
    """
    graph = defaultdict(list)
    for u, v in edges:
        graph[u].append(v)
        graph[v].append(u)   # Undirected
    return graph

edges = [(1, 2), (1, 3), (2, 4), (3, 4), (4, 5)]
graph = build_adjacency_list(edges)

print(dict(graph))
# {1: [2, 3], 2: [1, 4], 3: [1, 4], 4: [2, 3, 5], 5: [4]}

# BFS using the adjacency list
def bfs(graph, start):
    from collections import deque
    visited = {start}
    queue = deque([start])
    order = []
    while queue:
        node = queue.popleft()
        order.append(node)
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)
    return order

print(bfs(graph, 1))  # [1, 2, 3, 4, 5]

Multi-Level Grouping

from collections import defaultdict

# Group sales by (region, product_category)
sales = [
    ("West", "Electronics", 1500),
    ("East", "Clothing",    800),
    ("West", "Clothing",    600),
    ("East", "Electronics", 2100),
    ("West", "Electronics", 900),
    ("East", "Clothing",    400),
]

# Nested defaultdict - group by region, then by category
by_region_category = defaultdict(lambda: defaultdict(list))

for region, category, amount in sales:
    by_region_category[region][category].append(amount)

for region, categories in sorted(by_region_category.items()):
    for category, amounts in sorted(categories.items()):
        total = sum(amounts)
        print(f"{region} / {category}: ${total:,} ({len(amounts)} sales)")

# Output:
# East / Clothing: $1,200 (2 sales)
# East / Electronics: $2,100 (1 sales)
# West / Clothing: $600 (1 sales)
# West / Electronics: $2,400 (2 sales)

:::tip lambda: defaultdict(list) for Nested Groups defaultdict(lambda: defaultdict(list)) creates a two-level grouping structure. The outer defaultdict factory is a lambda that returns a new defaultdict(list) each time a missing region is accessed. This pattern handles arbitrary nesting but note that the resulting object is not directly serializable to JSON - convert with json.dumps(dict(dict(d) for d in ...)) or use a recursive conversion. :::

Part 4 - When NOT to Use defaultdict

The Silent Bug Problem

defaultdict auto-creates keys on access. This can hide real bugs:

from collections import defaultdict

config = defaultdict(str)
config['host'] = 'db.prod.internal'
config['port'] = '5432'

# Typo in key name - no error raised!
hostname = config['hsot']    # 'hsot' - typo!
print(hostname)              # ''   ← empty string, not an error
print('hsot' in config)      # True ← typo key now exists in config!

# With a plain dict, this would be an immediate KeyError - easy to catch
plain_config = {'host': 'db.prod.internal', 'port': '5432'}
try:
    hostname = plain_config['hsot']
except KeyError as e:
    print(f"Bug caught immediately: {e}")   # Bug caught immediately: 'hsot'

The Rule: defaultdict for Accumulation, dict for Lookup

# RIGHT: defaultdict for accumulation (you WANT missing keys to initialize)
from collections import defaultdict

index = defaultdict(list)
for doc_id, word in document_word_pairs:
    index[word].append(doc_id)   # Always appending - missing = new list

# WRONG: defaultdict for config/record lookup (missing key = bug, not default)
settings = defaultdict(bool)
settings['feature_x_enabled'] = True

# This silently returns False for a typo instead of raising KeyError:
if settings['feature_X_enabled']:   # Capital X - typo!
    enable_feature_x()              # Never called - bug hidden

# RIGHT for config: use plain dict, or dict.get() with an explicit default
settings = {'feature_x_enabled': True}
if settings.get('feature_x_enabled', False):
    enable_feature_x()

:::danger defaultdict Hides KeyError KeyError is often a signal of a real bug - a wrong key, a typo, or a missing configuration. defaultdict silences it. Use defaultdict only for accumulation patterns (grouping, counting, adjacency lists) where a missing key genuinely means "start with an empty container." For lookup patterns (config, records, caches), use plain dict so bugs are caught immediately. :::

Part 5 - collections.Counter: Frequency Counting as a First-Class Operation

Counter is a dict Subclass

Counter inherits from dict. Every key maps to its count (an integer). Accessing a missing key returns 0 - not a KeyError and not by creating the key permanently (unlike defaultdict):

from collections import Counter

# Create from iterable
words = "the quick brown fox jumps over the lazy dog the fox".split()
freq = Counter(words)

print(freq)
# Counter({'the': 3, 'fox': 2, 'quick': 1, 'brown': 1, ...})

print(freq['the'])        # 3
print(freq['missing'])    # 0 ← no KeyError, key NOT added to dict
print('missing' in freq)  # False ← key was not stored

# Create from dict
inventory = Counter({'apple': 50, 'banana': 20, 'cherry': 5})

# Create from keyword arguments
votes = Counter(alice=120, bob=95, carol=87)

# Create empty, then update
c = Counter()
c.update("abracadabra")   # Can update with string (counts each char)
print(c)
# Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})

most_common(n) Uses heapq Internally

Counter.most_common(n) returns the n most common elements and their counts, from most to least frequent.

freq = Counter("abracadabra")

# All elements sorted by count (most common first)
print(freq.most_common())
# [('a', 5), ('b', 2), ('r', 2), ('c', 1), ('d', 1)]

# Top 2 only
print(freq.most_common(2))
# [('a', 5), ('b', 2)]

Internally, when n is specified and n < len(counter) / 2, CPython uses heapq.nlargest - which runs in O(k + n log k) time where k is the total number of unique elements, n is how many you want. This is faster than sorting the full dict (O(k log k)) when you only need a small fraction of the top elements.

most_common(n) complexity:
  n == None : O(k log k)    - full sort of all k unique elements
  n < k     : O(k + n log k) - heapq.nlargest, faster for small n
  n == 1    : O(k)           - single linear scan for maximum

Where k = number of unique elements (not total elements)

import heapq

# What most_common() does internally for n < len(counter):
def most_common_manual(counter, n):
    if n is None:
        return sorted(counter.items(), key=lambda x: -x[1])
    return heapq.nlargest(n, counter.items(), key=lambda x: x[1])

freq = Counter("abracadabra")
print(most_common_manual(freq, 2))
# [('a', 5), ('b', 2)]

Real-World: Word Frequency Analysis

from collections import Counter
import re

def top_words(text: str, n: int = 10, min_length: int = 4):
    """
    Find top N words by frequency, excluding short words.
    """
    # Tokenize: lowercase, keep only alphabetic words
    words = re.findall(r'\b[a-z]{' + str(min_length) + r',}\b', text.lower())
    counter = Counter(words)
    return counter.most_common(n)

sample = """
Python is a high-level programming language known for its readability.
Python emphasizes code readability and simplicity. The Python community
values readable code over clever code. Many data science libraries
are written in Python because Python makes data manipulation easy.
"""

print(top_words(sample, n=5))
# [('python', 4), ('code', 3), ('readable', 2), ('readability', 2), ('data', 2)]

Part 6 - Counter Arithmetic: Operating on Frequency Data

Counter supports four arithmetic operators for combining frequency counts. This makes it a natural multiset - a generalization of set where elements can appear multiple times.

The Four Operators

from collections import Counter

inventory_A = Counter({'apples': 10, 'oranges': 5, 'bananas': 8})
inventory_B = Counter({'apples': 3, 'oranges': 7, 'grapes': 4})

# + : Combine counts (union-like, keeps all elements)
combined = inventory_A + inventory_B
print(combined)
# Counter({'apples': 13, 'bananas': 8, 'oranges': 12, 'grapes': 4})

# - : Subtract counts, drop zero/negative (difference)
after_sale = inventory_A - inventory_B
print(after_sale)
# Counter({'apples': 7, 'bananas': 8})
# oranges: 5-7 = -2 → dropped (negative → not included)
# grapes: 0-4 = -4 → dropped

# & : Intersection - minimum of corresponding counts
minimum_stock = inventory_A & inventory_B
print(minimum_stock)
# Counter({'apples': 3, 'oranges': 5})
# Only elements in BOTH, with the lower count

# | : Union - maximum of corresponding counts
maximum_stock = inventory_A | inventory_B
print(maximum_stock)
# Counter({'apples': 10, 'oranges': 7, 'bananas': 8, 'grapes': 4})

subtract() vs the - Operator

This is one of the most important distinctions in Counter:

from collections import Counter

# - operator: drops zero and negative counts
a = Counter({'x': 3, 'y': 1, 'z': 5})
b = Counter({'x': 5, 'y': 1, 'z': 2})

result_minus = a - b
print(result_minus)
# Counter({'z': 3})
# x: 3-5 = -2 → dropped
# y: 1-1 = 0  → dropped
# z: 5-2 = 3  → kept

# subtract(): KEEPS zero and negative counts
a_copy = Counter({'x': 3, 'y': 1, 'z': 5})
a_copy.subtract(b)
print(a_copy)
# Counter({'z': 3, 'y': 0, 'x': -2})
# All results kept, including zero and negative

When subtract() Matters

from collections import Counter

# Tracking inventory changes - need to know what went negative (backorders)
current_stock = Counter({'widget_A': 10, 'widget_B': 5, 'widget_C': 3})
order = Counter({'widget_A': 7, 'widget_B': 8, 'widget_C': 1})

# Use subtract to see actual delta including backorders
current_stock.subtract(order)
print(current_stock)
# Counter({'widget_A': 3, 'widget_C': 2, 'widget_B': -3})

# Separate fulfilled from backlogged
fulfilled = {k: v for k, v in current_stock.items() if v >= 0}
backorders = {k: -v for k, v in current_stock.items() if v < 0}

print("Fulfilled:", fulfilled)    # {'widget_A': 3, 'widget_C': 2}
print("Backorders:", backorders)  # {'widget_B': 3}

# With - operator, backorders disappear silently:
stock2 = Counter({'widget_A': 10, 'widget_B': 5, 'widget_C': 3})
print(stock2 - order)
# Counter({'widget_A': 3, 'widget_C': 2})
# widget_B: -3 silently dropped - backorder information lost!

:::warning subtract() vs - : Critical Difference Use the - operator when you want only positive-count results (clean multiset difference). Use subtract() when you need to preserve zero and negative counts - inventory deficits, statistical residuals, debt tracking. :::

Part 7 - Counter as a Multiset

A multiset is a set that allows duplicate elements. Counter is Python's multiset implementation:

from collections import Counter

# A/B test result aggregation
# Group A: 120 conversions, 400 non-conversions
# Group B: 95 conversions, 450 non-conversions
group_a = Counter(conversion=120, no_conversion=400)
group_b = Counter(conversion=95, no_conversion=450)

total = group_a + group_b
print(total)
# Counter({'no_conversion': 850, 'conversion': 215})

conversion_rate_a = group_a['conversion'] / sum(group_a.values())
conversion_rate_b = group_b['conversion'] / sum(group_b.values())
print(f"Group A conversion: {conversion_rate_a:.1%}")   # 23.1%
print(f"Group B conversion: {conversion_rate_b:.1%}")   # 17.4%

# Elements() - iterate over elements with repetition (like expanding the multiset)
small = Counter({'a': 3, 'b': 2})
print(list(small.elements()))   # ['a', 'a', 'a', 'b', 'b']

# Useful for reconstructing a dataset from a frequency count
import random
population = list(small.elements())
sample = random.choices(population, k=10)   # Sample from the distribution

Part 8 - Production Examples

Example 1: Vote Counting System

from collections import Counter
from typing import List, Tuple

def tally_votes(votes: List[str]) -> dict:
    """
    Tally election votes with full results.
    O(n) time where n = total votes.
    """
    tally = Counter(votes)
    total = sum(tally.values())

    results = []
    for candidate, count in tally.most_common():
        pct = count / total * 100
        results.append({
            'candidate': candidate,
            'votes': count,
            'percentage': round(pct, 2),
        })

    return {
        'results': results,
        'total_votes': total,
        'winner': tally.most_common(1)[0][0],
    }

ballots = (
    ['Alice'] * 4200 +
    ['Bob'] * 3800 +
    ['Carol'] * 1500 +
    ['Dave'] * 500
)

import random
random.shuffle(ballots)
report = tally_votes(ballots)

print(f"Winner: {report['winner']}")
for r in report['results']:
    bar = '█' * (r['votes'] // 200)
    print(f"  {r['candidate']:10s}: {r['votes']:5d} ({r['percentage']:5.1f}%) {bar}")

# Output:
# Winner: Alice
#   Alice     :  4200 (42.0%) █████████████████████
#   Bob       :  3800 (38.0%) ███████████████████
#   Carol     :  1500 (15.0%) ███████
#   Dave      :   500 ( 5.0%) ██

Example 2: Log Aggregation Pipeline

from collections import Counter, defaultdict
import re
from datetime import datetime

def analyze_access_log(log_lines: list) -> dict:
    """
    Parse and aggregate web server access log.
    Returns status code counts, top URLs, and error breakdown.
    """
    # Counters for different dimensions
    status_counts = Counter()
    url_counts = Counter()
    error_urls = defaultdict(Counter)   # error_urls[status_code][url] = count

    # Regex for common log format: IP - - [date] "METHOD URL HTTP" STATUS bytes
    pattern = re.compile(r'"(?:GET|POST|PUT|DELETE) (/\S*) HTTP/[\d.]+" (\d{3})')

    for line in log_lines:
        match = pattern.search(line)
        if not match:
            continue
        url, status = match.group(1), match.group(2)
        status_counts[status] += 1
        url_counts[url] += 1
        if status.startswith('4') or status.startswith('5'):
            error_urls[status][url] += 1

    return {
        'total_requests': sum(status_counts.values()),
        'status_breakdown': dict(status_counts.most_common()),
        'top_10_urls': url_counts.most_common(10),
        'error_breakdown': {s: dict(urls.most_common(5)) for s, urls in error_urls.items()},
    }

# Simulated log lines
sample_logs = [
    '1.2.3.4 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 200 1234',
    '5.6.7.8 - - [01/Mar/2026] "GET /api/courses HTTP/1.1" 200 4567',
    '9.10.11.12 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 404 89',
    '1.2.3.4 - - [01/Mar/2026] "POST /api/login HTTP/1.1" 200 256',
    '13.14.15.16 - - [01/Mar/2026] "GET /api/users HTTP/1.1" 500 0',
]

report = analyze_access_log(sample_logs)
print(f"Total requests: {report['total_requests']}")
print(f"Status breakdown: {report['status_breakdown']}")
print(f"Top URLs: {report['top_10_urls']}")

Example 3: Inventory Management with Counter Arithmetic

from collections import Counter

class Inventory:
    """
    Warehouse inventory tracker using Counter.
    Supports receiving stock, fulfilling orders, and backorder detection.
    """
    def __init__(self):
        self._stock = Counter()

    def receive(self, shipment: dict):
        """Add stock from a shipment."""
        self._stock.update(shipment)
        print(f"Received: {shipment}")

    def fulfill_order(self, order: dict) -> dict:
        """
        Attempt to fulfill an order.
        Returns fulfilled items and backorders separately.
        """
        order_counter = Counter(order)
        before = Counter(self._stock)
        self._stock.subtract(order_counter)

        fulfilled = {}
        backorders = {}

        for item, qty in order.items():
            available_before = before[item]
            if available_before >= qty:
                fulfilled[item] = qty
            elif available_before > 0:
                fulfilled[item] = available_before
                backorders[item] = qty - available_before
            else:
                backorders[item] = qty

            # Keep stock non-negative (remove backorder debt from stock)
            if self._stock[item] < 0:
                self._stock[item] = 0

        return {'fulfilled': fulfilled, 'backorders': backorders}

    def stock_report(self):
        return dict(self._stock.most_common())

inv = Inventory()
inv.receive({'widget': 100, 'gadget': 50, 'doohickey': 10})
inv.receive({'widget': 20, 'sprocket': 30})

result = inv.fulfill_order({'widget': 80, 'gadget': 60, 'sprocket': 15})
print("Fulfilled:", result['fulfilled'])
# Fulfilled: {'widget': 80, 'gadget': 50, 'sprocket': 15}
print("Backorders:", result['backorders'])
# Backorders: {'gadget': 10}
print("Remaining stock:", inv.stock_report())
# Remaining stock: {'widget': 40, 'sprocket': 15, 'doohickey': 10, 'gadget': 0}

Interview Questions

Q1: What is `defaultdict` and how does it differ from a plain `dict`?

Answer: defaultdict is a subclass of dict that adds a default_factory callable. When a missing key is accessed, defaultdict calls default_factory() with no arguments, stores the result under that key, and returns it - instead of raising KeyError. A plain dict raises KeyError on missing key access. The key difference: defaultdict permanently stores the default value (the key is created), while dict.get(k, default) returns a fallback without storing it. Use defaultdict for accumulation patterns (grouping, counting), not for record lookup where KeyError should signal a bug.

Q2: Explain `Counter.most_common(n)` - what does it return, and what algorithm does it use internally?

Answer: most_common(n) returns a list of (element, count) tuples, ordered from most to least frequent, limited to the top n elements. When n is specified and less than the total unique elements, CPython uses heapq.nlargest(n, counter.items(), key=lambda x: x[1]) - which runs in O(k + n log k) where k is the number of unique elements. This is faster than a full sort O(k log k) when you only need a small fraction. When n is None (all elements), it uses a full sort. When n=1, an O(k) linear scan is used.

Q3: What is the difference between `Counter.subtract()` and the `-` operator?

Answer: The - operator computes a multiset difference that only keeps positive counts - zero and negative results are discarded. Counter({'a': 3}) - Counter({'a': 5}) returns Counter() because 3-5=-2 is negative and dropped. Counter.subtract() computes the same arithmetic but keeps all results including zero and negative - Counter({'a': 3}).subtract(Counter({'a': 5})) leaves Counter({'a': -2}). Use - when you want clean set-difference semantics. Use subtract() when the negative count carries meaning (e.g., backorders, deficits, debt).

Q4: When should you NOT use `defaultdict`?

Answer: When a KeyError on missing key access should be treated as a bug. For example: reading configuration values (a missing config key is a misconfiguration, not a "use default" signal), accessing records by ID (a missing ID means the record doesn't exist), or any lookup where missing means wrong input. defaultdict silently creates keys on access - a typo in a key name (config['hsot'] instead of config['host']) returns an empty default and writes the typo key into the dict, potentially masking the bug for a long time. Use defaultdict only when missing = "start with an empty container," not when missing = "this is an error."

Q5: How would you use `Counter` and `defaultdict` together to implement word co-occurrence counting?

Answer: Use defaultdict(Counter) - each word maps to a Counter of words that appear near it.

from collections import defaultdict, Counter

def build_cooccurrence(sentences: list, window: int = 2) -> dict:
    cooccur = defaultdict(Counter)
    for sentence in sentences:
        words = sentence.lower().split()
        for i, word in enumerate(words):
            for j in range(max(0, i - window), min(len(words), i + window + 1)):
                if i != j:
                    cooccur[word][words[j]] += 1
    return cooccur

sentences = ["the cat sat on the mat", "the cat ate the rat"]
result = build_cooccurrence(sentences)
print(result['cat'].most_common(3))
# [('the', 4), ('sat', 1), ('ate', 1)]

Q6: What is the time and space complexity of building a Counter from a list of n elements with k unique values?

Answer: Time: O(n) - each element requires one dict lookup and one increment, both O(1) average due to hash table. Space: O(k) - the Counter stores at most k key-value pairs (one per unique element), regardless of n. For example, counting character frequencies in a 1 MB string still uses O(1) space (at most 128 ASCII or ~150k Unicode code points) regardless of string length. The most_common() call adds O(k log k) time for the sort. This makes Counter very memory-efficient for highly repetitive data.

Practice Challenges

Level 1 - Predict the Output

What does this print?

from collections import Counter

a = Counter("aabbcc")
b = Counter("bccdd")

print(a + b)
print(a - b)
print(a & b)
print(a | b)

Solution

Output:

Counter({'c': 4, 'b': 3, 'a': 2, 'd': 2})
Counter({'a': 2, 'b': 1})
Counter({'b': 1, 'c': 2})
Counter({'a': 2, 'b': 2, 'c': 2, 'd': 2})

a = {'a':2, 'b':2, 'c':2}, b = {'b':1, 'c':2, 'd':2}
a + b: add all counts: {'a':2, 'b':3, 'c':4, 'd':2}
a - b: subtract, drop ≤0: a='a':2 (stays), 'b':2-1=1 (stays), 'c':2-2=0 (dropped), d:0-2=-2 (dropped) → {'a':2, 'b':1}
a & b: minimum per key, only keys in both: 'b':min(2,1)=1, 'c':min(2,2)=2 → {'b':1, 'c':2}
a | b: maximum per key, all keys: 'a':2, 'b':max(2,1)=2, 'c':2, 'd':2

Level 2 - Implement a Streaming Top-K Monitor

Build a class that processes a stream of events and always reports the top K most frequent event types. It must support:

record(event_type) - O(1) amortized
top_k() - returns the k most frequent types seen so far

monitor = TopKMonitor(k=3)
for event in ["login", "click", "login", "purchase", "click", "login", "view", "click"]:
    monitor.record(event)

print(monitor.top_k())
# [('login', 3), ('click', 3), ('purchase', 1)]
# (or any top 3 by count - ties resolved by arrival order)

Solution

from collections import Counter

class TopKMonitor:
    """
    Tracks top-k most frequent events in a stream.
    record(): O(1) amortized - Counter update is O(1) average
    top_k(): O(n + k log n) where n = unique event types - uses heapq internally
    Space: O(n) where n = unique event types seen
    """
    def __init__(self, k: int):
        self.k = k
        self._counts = Counter()

    def record(self, event_type: str):
        self._counts[event_type] += 1   # O(1) average

    def top_k(self):
        return self._counts.most_common(self.k)

monitor = TopKMonitor(k=3)
events = ["login", "click", "login", "purchase", "click", "login", "view", "click"]
for event in events:
    monitor.record(event)

print(monitor.top_k())
# [('login', 3), ('click', 3), ('purchase', 1)]

Why this works: Counter is a dict subclass - counter[key] += 1 is O(1) average. most_common(k) uses heapq.nlargest - O(n + k log n) where n is unique event types. For a monitor where k is small (e.g., top 10) but event types might be thousands, this is significantly faster than a full sort.

Extension: For a true streaming system, you'd want a sliding window (last T seconds). That requires a different approach: a deque(maxlen=T) to store recent events, and rebuilding the counter periodically or using a decrement-on-evict strategy.

Level 3 - Build a Dependency Graph Resolver

You have a list of tasks where each task may depend on other tasks. Build a system using defaultdict that:

Builds a dependency graph
Detects circular dependencies
Returns a valid execution order (topological sort)

tasks = {
    'deploy': ['test', 'build'],
    'test': ['build', 'lint'],
    'build': ['lint'],
    'lint': [],
    'notify': ['deploy'],
}

resolver = DependencyResolver(tasks)
print(resolver.execution_order())
# ['lint', 'build', 'test', 'deploy', 'notify']
# (lint has no deps → runs first; notify depends on deploy → runs last)

# Add a circular dependency:
tasks['lint'] = ['notify']   # lint → notify → deploy → test → lint → circular!
resolver2 = DependencyResolver(tasks)
print(resolver2.execution_order())   # Raises: CircularDependencyError

Solution

from collections import defaultdict, deque

class CircularDependencyError(Exception):
    pass

class DependencyResolver:
    """
    Topological sort using Kahn's algorithm (BFS-based).
    Uses defaultdict for the graph and in-degree tracking.

    Time:  O(V + E) where V = tasks, E = total dependencies
    Space: O(V + E) for the graph and in-degree table
    """
    def __init__(self, tasks: dict):
        self.tasks = tasks
        # Build adjacency list: dependency → [tasks that depend on it]
        self.graph = defaultdict(list)    # reverse edges for topological sort
        self.in_degree = defaultdict(int) # number of unresolved deps per task

        # Initialize all tasks (including those with no deps)
        for task in tasks:
            self.in_degree[task]   # Access to initialize to 0 via defaultdict

        for task, deps in tasks.items():
            for dep in deps:
                self.graph[dep].append(task)  # dep must run before task
                self.in_degree[task] += 1

    def execution_order(self) -> list:
        """
        Kahn's algorithm: process nodes with in_degree=0 first.
        If we can't process all nodes → circular dependency.
        """
        # Start with all tasks that have no dependencies
        queue = deque(
            task for task, degree in self.in_degree.items() if degree == 0
        )
        order = []

        while queue:
            task = queue.popleft()
            order.append(task)

            # For each task that depended on this one:
            for dependent in self.graph[task]:
                self.in_degree[dependent] -= 1
                if self.in_degree[dependent] == 0:
                    queue.append(dependent)

        if len(order) != len(self.tasks):
            remaining = [t for t in self.tasks if t not in order]
            raise CircularDependencyError(
                f"Circular dependency detected involving: {remaining}"
            )

        return order

# Test valid dependency graph
tasks = {
    'deploy': ['test', 'build'],
    'test': ['build', 'lint'],
    'build': ['lint'],
    'lint': [],
    'notify': ['deploy'],
}
resolver = DependencyResolver(tasks)
print(resolver.execution_order())
# ['lint', 'build', 'test', 'deploy', 'notify']

# Test circular dependency
tasks_circular = {
    'a': ['b'],
    'b': ['c'],
    'c': ['a'],   # Circular: a → b → c → a
}
resolver2 = DependencyResolver(tasks_circular)
try:
    resolver2.execution_order()
except CircularDependencyError as e:
    print(e)  # Circular dependency detected involving: ['a', 'b', 'c']

Key design decisions:

defaultdict(list) for the reverse adjacency graph - no initialization needed per node
defaultdict(int) for in-degree counts - missing key auto-initializes to 0
deque for the BFS queue - O(1) popleft, appropriate for BFS (see Lesson 13)
Kahn's algorithm naturally detects cycles: if the output has fewer nodes than the input, a cycle exists among the unprocessed nodes

Quick Reference

Tool	Key Pattern	Code	When to Use
`defaultdict(list)`	Group items by key	`dd[k].append(v)`	One-to-many mapping, adjacency lists
`defaultdict(int)`	Count with arithmetic	`dd[k] += amount`	Sums, deltas (use `Counter` for plain counts)
`defaultdict(set)`	Unique values per key	`dd[k].add(v)`	Deduplicated grouping
`defaultdict(Counter)`	Nested frequency counts	`dd[k][v] += 1`	Co-occurrence, multi-dim counts
`dict.get(k, default)`	Read with fallback	`d.get(k, 0)`	Single-read, don't create missing key
`dict.setdefault(k, [])`	Get-or-create in plain dict	`.setdefault(k, []).append(v)`	When you need a plain dict
`Counter(iterable)`	Count all elements	`Counter(words)`	Frequency analysis
`Counter.most_common(n)`	Top N elements	`.most_common(5)`	Leaderboards, histograms
`Counter + Counter`	Combine counts	`a + b`	Union of frequency data
`Counter - Counter`	Remove counts (drop ≤0)	`a - b`	Clean multiset difference
`Counter.subtract()`	Subtract (keep negatives)	`a.subtract(b)`	Deficit/backorder tracking
`Counter & Counter`	Minimum counts	`a & b`	Common elements, minimum stock
`Counter	Counter`	Maximum counts	`a
`Counter.elements()`	Expand to multiset	`list(c.elements())`	Reconstruct a dataset from frequencies

Key Takeaways

defaultdict eliminates the KeyError pattern for accumulation by calling a factory callable when a key is missing - use list for grouping, int for counting, set for unique-per-key, but never use it for configuration or record lookup where missing should be an error
dict.get(k, default) returns a fallback without storing it; setdefault(k, default) returns and stores it; defaultdict automatically stores the factory result - know which semantics you need before choosing
Counter is a dict subclass that returns 0 for missing keys (without storing them), supports most_common(n) via heapq.nlargest for O(k + n log k) top-k retrieval, and behaves as a mathematical multiset with +, -, &, | operators
Counter.subtract() preserves zero and negative counts; the - operator drops them - this distinction matters for inventory deficits, statistical residuals, and debt tracking
Both Counter and defaultdict are O(1) average for inserts and lookups - the hash table underneath is the same as dict, so the overhead is negligible at any realistic scale
defaultdict can hide bugs by silently returning defaults for typo'd keys - always audit whether a missing key is "expected" or "a mistake" before choosing defaultdict over plain dict
For the groupby pattern (defaultdict(list)), one pass over the data is sufficient - no sorting required and no intermediate storage

What You Will Learn​

Prerequisites​

Part 1 - defaultdict: Eliminating the KeyError Pattern​

What defaultdict Does Internally​

The Three Core Patterns​

Part 2 - defaultdict vs setdefault vs get​

Side-by-Side Comparison​

When to Use Each​

The get() Bug That Bites Beginners​

Part 3 - defaultdict for Groupby-Style Aggregation​

Building an Adjacency List for Graph Algorithms​

Multi-Level Grouping​

Part 4 - When NOT to Use defaultdict​

The Silent Bug Problem​

The Rule: defaultdict for Accumulation, dict for Lookup​

Part 5 - collections.Counter: Frequency Counting as a First-Class Operation​

Counter is a dict Subclass​

most_common(n) Uses heapq Internally​

Real-World: Word Frequency Analysis​

Part 6 - Counter Arithmetic: Operating on Frequency Data​

The Four Operators​

subtract() vs the - Operator​

When subtract() Matters​

Part 7 - Counter as a Multiset​

Part 8 - Production Examples​

Example 1: Vote Counting System​

Example 2: Log Aggregation Pipeline​

Example 3: Inventory Management with Counter Arithmetic​

Interview Questions​

Q1: What is defaultdict and how does it differ from a plain dict?​

Q2: Explain Counter.most_common(n) - what does it return, and what algorithm does it use internally?​

Q3: What is the difference between Counter.subtract() and the - operator?​

Q4: When should you NOT use defaultdict?​

Q5: How would you use Counter and defaultdict together to implement word co-occurrence counting?​

Q6: What is the time and space complexity of building a Counter from a list of n elements with k unique values?​

Practice Challenges​

Level 1 - Predict the Output​

Level 2 - Implement a Streaming Top-K Monitor​

Level 3 - Build a Dependency Graph Resolver​

Quick Reference​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - defaultdict: Eliminating the KeyError Pattern

What defaultdict Does Internally

The Three Core Patterns

Part 2 - defaultdict vs setdefault vs get

Side-by-Side Comparison

When to Use Each

The get() Bug That Bites Beginners

Part 3 - defaultdict for Groupby-Style Aggregation

Building an Adjacency List for Graph Algorithms

Multi-Level Grouping

Part 4 - When NOT to Use defaultdict

The Silent Bug Problem

The Rule: defaultdict for Accumulation, dict for Lookup

Part 5 - collections.Counter: Frequency Counting as a First-Class Operation

Counter is a dict Subclass

most_common(n) Uses heapq Internally

Real-World: Word Frequency Analysis

Part 6 - Counter Arithmetic: Operating on Frequency Data

The Four Operators

subtract() vs the - Operator

When subtract() Matters

Part 7 - Counter as a Multiset

Part 8 - Production Examples

Example 1: Vote Counting System

Example 2: Log Aggregation Pipeline

Example 3: Inventory Management with Counter Arithmetic

Interview Questions

Q1: What is `defaultdict` and how does it differ from a plain `dict`?

Q2: Explain `Counter.most_common(n)` - what does it return, and what algorithm does it use internally?

Q3: What is the difference between `Counter.subtract()` and the `-` operator?

Q4: When should you NOT use `defaultdict`?

Q5: How would you use `Counter` and `defaultdict` together to implement word co-occurrence counting?

Q6: What is the time and space complexity of building a Counter from a list of n elements with k unique values?

Practice Challenges

Level 1 - Predict the Output

Level 2 - Implement a Streaming Top-K Monitor

Level 3 - Build a Dependency Graph Resolver

Quick Reference

Key Takeaways