Python Counter and Defaultdict Practice Problems & Exercises

Practice: Counter and Defaultdict

11 problems4 Easy4 Medium3 Hard⏱ 40-55 min

Easy

#1Predict Counter Arithmetic OutputEasy

counterarithmeticpredict-output

Predict the output of each Counter arithmetic operation before running the code. Think carefully about what happens to keys with zero or negative results.

Python

from collections import Counter

a = Counter("aabbcc")
b = Counter("bccdd")

print(a + b)
print(a - b)
print(a & b)
print(a | b)

Solution

Counter({'c': 4, 'b': 3, 'a': 2, 'd': 2})
Counter({'a': 2, 'b': 1})
Counter({'c': 2, 'b': 1})
Counter({'a': 2, 'b': 2, 'c': 2, 'd': 2})

Breakdown:

a = Counter({'a': 2, 'b': 2, 'c': 2}) and b = Counter({'b': 1, 'c': 2, 'd': 2})
a + b: Add all counts element-wise. a=2, b=3, c=4, d=2.
a - b: Subtract and drop zero/negative. a: 2-0=2 (kept), b: 2-1=1 (kept), c: 2-2=0 (dropped), d: 0-2=-2 (dropped).
a & b: Minimum of each key, only keys in both. b: min(2,1)=1, c: min(2,2)=2. Key a is not in b, key d is not in a -- both excluded.
a | b: Maximum of each key, all keys. a: max(2,0)=2, b: max(2,1)=2, c: max(2,2)=2, d: max(0,2)=2.

Expected Output

Counter({'c': 4, 'b': 3, 'a': 2, 'd': 2})
Counter({'a': 2, 'b': 1})
Counter({'c': 2, 'b': 1})
Counter({'a': 2, 'b': 2, 'c': 2, 'd': 2})

Hints

Hint 1: Counter("aabbcc") counts each character: a=2, b=2, c=2. Counter("bccdd") gives b=1, c=2, d=2.

Hint 2: The - operator drops results that are zero or negative. The & operator takes the minimum count per key (only keys in both). The | operator takes the maximum count per key (all keys).

#2Basic Word Frequency with CounterEasy

countermost_commonfrequency

Write a function that takes a string and an integer top_n, and returns the top_n most frequent words (case-insensitive) as a list of (word, count) tuples.

Python

from collections import Counter

def word_frequency(text: str, top_n: int) -> list:
    words = text.lower().split()
    return Counter(words).most_common(top_n)

# Test cases
print(word_frequency("the cat sat on the mat the cat", 2))
print(word_frequency("Python is great and python code is readable python code python", 2))

Solution

from collections import Counter

def word_frequency(text: str, top_n: int) -> list:
    words = text.lower().split()
    return Counter(words).most_common(top_n)

Key points:

text.lower().split() normalizes case and tokenizes in one chain.
Counter(words) builds the frequency table in O(n) time.
.most_common(top_n) uses heapq.nlargest internally -- O(k + n log k) where k is unique words. For small top_n relative to vocabulary size, this is faster than sorting the entire counter.

Starter Code

from collections import Counter

def word_frequency(text: str, top_n: int) -> list:
  """
  Return the top_n most frequent words (lowercased) as a list
  of (word, count) tuples sorted by count descending.

  Example:
      word_frequency("the cat sat on the mat the cat", 2)
      -> [('the', 3), ('cat', 2)]
  """
  # TODO: implement this
  pass

Expected Output

[('the', 3), ('cat', 2)]
[('python', 4), ('code', 3)]

Hints

Hint 1: Split the text into words with .split() and convert to lowercase with .lower() before counting.

Hint 2: Counter has a .most_common(n) method that returns exactly what you need.

#3Group Items with defaultdict(list)Easy

defaultdictlistgrouping

Write a function that groups a list of words by their length using defaultdict(list). Return a plain dict where keys are lengths and values are sorted lists of words with that length.

Python

from collections import defaultdict

def group_by_length(words: list) -> dict:
    groups = defaultdict(list)
    for word in words:
        groups[len(word)].append(word)
    return {k: sorted(v) for k, v in groups.items()}

# Test cases
print(group_by_length(["hi", "the", "at", "dog", "I"]))
print(group_by_length(["model", "data", "train", "batch", "code", "deploy"]))

Solution

from collections import defaultdict

def group_by_length(words: list) -> dict:
    groups = defaultdict(list)
    for word in words:
        groups[len(word)].append(word)
    return {k: sorted(v) for k, v in groups.items()}

Key points:

defaultdict(list) eliminates the need for if key not in groups: groups[key] = [] before appending.
Converting back to a plain dict with dict() or a dict comprehension is good practice when returning from a function -- callers may not expect a defaultdict.
Sorting within each group costs O(m log m) per group where m is the group size, but the total work across all groups is O(n log n) in the worst case.

Starter Code

from collections import defaultdict

def group_by_length(words: list) -> dict:
  """
  Group words by their length.
  Return a plain dict mapping length -> sorted list of words.

  Example:
      group_by_length(["hi", "the", "at", "dog", "I"])
      -> {2: ['at', 'hi'], 3: ['dog', 'the'], 1: ['I']}
  """
  # TODO: implement this
  pass

Expected Output

{2: ['at', 'hi'], 3: ['dog', 'the'], 1: ['I']}
{4: ['code', 'data'], 5: ['batch', 'model', 'train'], 6: ['deploy']}

Hints

Hint 1: Use defaultdict(list) so you can append without checking if the key exists first.

Hint 2: Sort the list for each length group before returning. Convert the defaultdict to a plain dict with dict().

#4Predict defaultdict BehaviorEasy

defaultdictmissing-keypredict-output

Predict the output of this code. Pay attention to which operations create keys and which do not.

Python

from collections import Counter, defaultdict

# Part 1: Counter missing-key behavior
c = Counter(["a", "a", "b"])
print(c["z"])         # What does this print?
print("z" in c)       # Is "z" now in the Counter?
print(len(c))         # How many keys?

# Part 2: defaultdict missing-key behavior
dd = defaultdict(list)
dd["a"].append(1)
dd["a"].append(2)
dd["b"].append(3)
_ = dd["c"]           # Access without appending
print(dd["b"])
print("c" in dd)      # Is "c" in the defaultdict?
print(dict(dd))

Solution

0
False
3
[3]
True
{'a': [1, 2], 'b': [3], 'c': []}

Key distinction:

Counter: Accessing a missing key returns 0 but does not store the key. c["z"] returns 0, and "z" in c is still False. The Counter has 3 keys: a, b (from the input), and nothing else.
defaultdict: Accessing a missing key creates and stores the default value. dd["c"] triggers __missing__, which calls list(), stores [] under key "c", and returns it. Now "c" in dd is True, and the dict shows 'c': [].

This is exactly why defaultdict can hide bugs -- merely reading a key is enough to create it.

Expected Output

0
False
3
['x']
True
{'a': [1, 2], 'b': [3], 'c': []}

Hints

Hint 1: Counter returns 0 for missing keys but does NOT store the key. defaultdict(list) DOES store the key when accessed.

Hint 2: When you access dd["c"] on a defaultdict(list), it creates an empty list and stores it -- even if you never append anything to it.

Medium

#5Character Frequency HistogramMedium

countermost_commonstring-processing

Write a function that takes a string and returns a character frequency histogram as a multiline string. Only count letters (case-insensitive). Sort by frequency descending, then alphabetically for ties.

Python

from collections import Counter

def char_histogram(text: str) -> str:
    counts = Counter(c for c in text.lower() if c.isalpha())
    sorted_chars = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
    lines = []
    for char, count in sorted_chars:
        lines.append(char + " | " + "#" * count)
    return "\n".join(lines)

# Test
print(char_histogram("Hello World!"))

Solution

from collections import Counter

def char_histogram(text: str) -> str:
    counts = Counter(c for c in text.lower() if c.isalpha())
    sorted_chars = sorted(counts.items(), key=lambda x: (-x[1], x[0]))
    lines = []
    for char, count in sorted_chars:
        lines.append(char + " | " + "#" * count)
    return "\n".join(lines)

Key points:

Generator expression (c for c in text.lower() if c.isalpha()) filters and normalizes in one pass -- no intermediate list needed.
sorted(items, key=lambda x: (-x[1], x[0])) sorts by count descending (negative trick), then by character ascending for stable tie-breaking.
We could use most_common() but it does not guarantee alphabetical order for ties. Explicit sorting gives us full control.

Starter Code

from collections import Counter

def char_histogram(text: str) -> str:
  """
  Build a horizontal histogram of character frequencies (letters only,
  case-insensitive). Each line: 'c | ###' where # count equals frequency.
  Sorted by frequency descending, then alphabetically for ties.

  Example:
      char_histogram("abba")
      ->
      a | ##
      b | ##
  """
  # TODO: implement this
  pass

Expected Output

e | ####
h | ##
l | ##
o | ##
d | #
r | #
w | #

Hints

Hint 1: Filter out non-alpha characters with a generator expression or str.isalpha(). Then use Counter on the lowercase version.

Hint 2: For sorting by frequency descending then alphabetically, use sorted() with a key that returns (-count, char).

#6Inventory Deficit Tracker with subtract()Medium

countersubtractarithmetic

Write a function that processes an order against inventory stock. Use Counter.subtract() to track deficits (negative counts mean backorders).

Python

from collections import Counter

def process_order(stock: dict, order: dict) -> dict:
    stock_counter = Counter(stock)
    order_counter = Counter(order)

    # Use subtract to preserve negative values (backorders)
    result = Counter(stock)
    result.subtract(order_counter)

    fulfilled = {}
    backorders = {}
    remaining = {}

    all_items = set(stock) | set(order)
    for item in all_items:
        ordered = order_counter[item]
        available = stock_counter[item]
        after = result[item]

        if ordered > 0:
            fulfilled[item] = min(ordered, available)
        if after < 0:
            backorders[item] = -after
        elif after > 0:
            remaining[item] = after

    return {
        'fulfilled': fulfilled,
        'backorders': backorders,
        'remaining': remaining,
    }

# Test
stock = {'widget': 15, 'bolt': 5, 'gear': 7}
order = {'widget': 10, 'bolt': 8, 'nut': 4}
print(process_order(stock, order))

Solution

from collections import Counter

def process_order(stock: dict, order: dict) -> dict:
    stock_counter = Counter(stock)
    order_counter = Counter(order)

    result = Counter(stock)
    result.subtract(order_counter)

    fulfilled = {}
    backorders = {}
    remaining = {}

    all_items = set(stock) | set(order)
    for item in all_items:
        ordered = order_counter[item]
        available = stock_counter[item]
        after = result[item]

        if ordered > 0:
            fulfilled[item] = min(ordered, available)
        if after < 0:
            backorders[item] = -after
        elif after > 0:
            remaining[item] = after

    return {
        'fulfilled': fulfilled,
        'backorders': backorders,
        'remaining': remaining,
    }

Why subtract() and not the - operator:

Counter({'bolt': 5}) - Counter({'bolt': 8}) returns Counter() -- the -3 is silently dropped. You lose the backorder information entirely.
Counter({'bolt': 5}).subtract(Counter({'bolt': 8})) keeps Counter({'bolt': -3}) -- the negative value tells you exactly how many units are backordered.
This is the critical difference: - is for clean multiset math; subtract() is for real-world tracking where deficits matter.

Starter Code

from collections import Counter

def process_order(stock: dict, order: dict) -> dict:
  """
  Process an order against current stock.
  Return a dict with:
    'fulfilled': items where stock >= order quantity (with qty fulfilled)
    'backorders': items where stock < order quantity (with deficit)
    'remaining': remaining stock (zero or positive only)

  Example:
      process_order({'a': 10, 'b': 3}, {'a': 5, 'b': 7, 'c': 2})
      -> {
          'fulfilled': {'a': 5, 'b': 3},
          'backorders': {'b': 4, 'c': 2},
          'remaining': {'a': 5}
      }
  """
  # TODO: implement this
  pass

Expected Output

{'fulfilled': {'widget': 10, 'bolt': 5}, 'backorders': {'bolt': 3, 'nut': 4}, 'remaining': {'widget': 5, 'gear': 7}}

Hints

Hint 1: Use Counter.subtract() (not the - operator) to keep negative values, which represent backorder deficits.

Hint 2: After subtracting, iterate the result: positive values are remaining stock, negative values are backorders. For fulfilled quantities, use min(original_stock, order_quantity) for each item.

#7Unique Tags per Author with defaultdict(set)Medium

defaultdictsetdeduplication

Given a list of (author, tag) tuples (with possible duplicates), return a dict mapping each author to a sorted list of their unique tags.

Python

from collections import defaultdict

def unique_tags_per_author(posts: list) -> dict:
    author_tags = defaultdict(set)
    for author, tag in posts:
        author_tags[author].add(tag)
    return {a: sorted(tags) for a, tags in sorted(author_tags.items())}

# Test
posts = [
    ("Alice", "python"), ("Bob", "python"), ("Alice", "ml"),
    ("Alice", "python"), ("Carol", "devops"), ("Bob", "go"),
    ("Alice", "data"), ("Carol", "python"), ("Bob", "python"),
]
print(unique_tags_per_author(posts))

Solution

from collections import defaultdict

def unique_tags_per_author(posts: list) -> dict:
    author_tags = defaultdict(set)
    for author, tag in posts:
        author_tags[author].add(tag)
    return {a: sorted(tags) for a, tags in sorted(author_tags.items())}

Why defaultdict(set) and not defaultdict(list):

defaultdict(list) with .append() would store duplicates: Alice -> ['python', 'ml', 'python', 'data']. You would need to deduplicate after the fact.
defaultdict(set) with .add() handles deduplication automatically at insertion time. Each .add() is O(1) average, and duplicates are silently ignored.
Choose the factory type based on whether you need order/duplicates (list) or uniqueness (set).

Starter Code

from collections import defaultdict

def unique_tags_per_author(posts: list) -> dict:
  """
  Given a list of (author, tag) tuples, return a dict mapping
  each author to a sorted list of their unique tags.

  Example:
      unique_tags_per_author([("A", "py"), ("A", "py"), ("A", "ml"), ("B", "py")])
      -> {'A': ['ml', 'py'], 'B': ['py']}
  """
  # TODO: implement this
  pass

Expected Output

{'Alice': ['data', 'ml', 'python'], 'Bob': ['go', 'python'], 'Carol': ['devops', 'python']}

Hints

Hint 1: Use defaultdict(set) so duplicates are automatically handled by set.add().

Hint 2: Convert each set to a sorted list before returning, and convert the defaultdict to a plain dict.

#8Anagram GrouperMedium

defaultdictlistsortinganagrams

Group words that are anagrams of each other using defaultdict(list). Two words are anagrams if they contain the exact same characters in any order.

Python

from collections import defaultdict

def group_anagrams(words: list) -> list:
    groups = defaultdict(list)
    for word in words:
        key = tuple(sorted(word))
        groups[key].append(word)
    result = [sorted(group) for group in groups.values()]
    result.sort(key=lambda g: g[0])
    return result

# Test cases
print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
print(group_anagrams(["stream", "alerts", "parsing", "alters", "sparing", "aelrst", "rasping"]))

Solution

from collections import defaultdict

def group_anagrams(words: list) -> list:
    groups = defaultdict(list)
    for word in words:
        key = tuple(sorted(word))
        groups[key].append(word)
    result = [sorted(group) for group in groups.values()]
    result.sort(key=lambda g: g[0])
    return result

Key points:

Grouping key: tuple(sorted(word)) converts "eat" to ('a', 'e', 't'). All anagrams produce the same tuple. We use tuple because lists are not hashable and cannot be dict keys.
defaultdict(list) eliminates the need to check if the key exists before appending.
Time complexity: O(n * k log k) where n is the number of words and k is the maximum word length. The sorting of each word is O(k log k), and we do it n times.
This is a classic interview problem (LeetCode #49) and defaultdict(list) is the idiomatic Python solution.

Starter Code

from collections import defaultdict

def group_anagrams(words: list) -> list:
  """
  Group words that are anagrams of each other.
  Return a list of groups (each group is a sorted list of words).
  The groups should be sorted by the first word in each group.

  Example:
      group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"])
      -> [['ate', 'eat', 'tea'], ['bat'], ['nat', 'tan']]
  """
  # TODO: implement this
  pass

Expected Output

[['ate', 'eat', 'tea'], ['bat'], ['nat', 'tan']]
[['aelrst', 'alerts', 'alters'], ['parsing', 'rasping', 'sparing'], ['stream']]

Hints

Hint 1: Two words are anagrams if they have the same sorted characters. Use tuple(sorted(word)) as the grouping key.

Hint 2: defaultdict(list) lets you append each word to the group for its sorted-character key without initialization checks.

Hard

#9Nested Grouping with defaultdict of defaultdictHard

defaultdictnestedaggregationlambda

Build a two-level sales summary using nested defaultdict. Group sales records by region and category, then compute totals, counts, and averages.

Python

from collections import defaultdict

def sales_summary(records: list) -> dict:
    nested = defaultdict(lambda: defaultdict(list))
    for region, category, amount in records:
        nested[region][category].append(amount)

    result = {}
    for region in sorted(nested):
        result[region] = {}
        for category in sorted(nested[region]):
            amounts = nested[region][category]
            total = sum(amounts)
            count = len(amounts)
            result[region][category] = {
                'total': total,
                'count': count,
                'avg': total / count,
            }
    return result

# Test
sales = [
    ("West", "Electronics", 1500),
    ("East", "Clothing", 800),
    ("West", "Clothing", 600),
    ("East", "Electronics", 2100),
    ("West", "Electronics", 900),
    ("East", "Clothing", 400),
]

summary = sales_summary(sales)
for region in sorted(summary):
    print(region + ":")
    for cat in sorted(summary[region]):
        s = summary[region][cat]
        print("  " + cat + ": total=" + str(s['total']) + ", count=" + str(s['count']) + ", avg=" + format(s['avg'], '.2f'))

Solution

from collections import defaultdict

def sales_summary(records: list) -> dict:
    nested = defaultdict(lambda: defaultdict(list))
    for region, category, amount in records:
        nested[region][category].append(amount)

    result = {}
    for region in sorted(nested):
        result[region] = {}
        for category in sorted(nested[region]):
            amounts = nested[region][category]
            total = sum(amounts)
            count = len(amounts)
            result[region][category] = {
                'total': total,
                'count': count,
                'avg': total / count,
            }
    return result

Key points:

defaultdict(lambda: defaultdict(list)) creates a two-level auto-initializing structure. The outer factory is a lambda that returns a new defaultdict(list) for each missing region. The inner factory is list for each missing category.
Why lambda? You cannot write defaultdict(defaultdict(list)) -- that would call defaultdict(list) immediately and pass the resulting object (not a callable) as the factory. The lambda defers construction.
The final conversion to a plain sorted dict is important for deterministic output and for callers that expect a regular dict.
Time complexity: O(n) for accumulation + O(k log k) for sorting where k is the number of unique keys. Space is O(n) for storing all amounts.

Starter Code

from collections import defaultdict

def sales_summary(records: list) -> dict:
  """
  Given a list of (region, category, amount) tuples, return a nested dict:
  {region: {category: {'total': sum, 'count': n, 'avg': sum/n}}}

  Outer keys sorted alphabetically. Inner keys sorted alphabetically.

  Example:
      sales_summary([("West", "Tech", 100), ("West", "Tech", 200), ("East", "Food", 50)])
      -> {
          'East': {'Food': {'total': 50, 'count': 1, 'avg': 50.0}},
          'West': {'Tech': {'total': 300, 'count': 2, 'avg': 150.0}}
      }
  """
  # TODO: implement this
  pass

Expected Output

East:
  Clothing: total=1200, count=2, avg=600.00
  Electronics: total=2100, count=1, avg=2100.00
West:
  Clothing: total=600, count=1, avg=600.00
  Electronics: total=2400, count=2, avg=1200.00

Hints

Hint 1: Use defaultdict(lambda: defaultdict(list)) to build a nested structure where you can append amounts per region per category.

Hint 2: After collecting all amounts, compute the summary stats (total, count, avg) in a second pass over the nested structure.

#10Can Construct from Magazine (Counter Subtraction)Hard

countersubtractmultisetinterview-classic

Classic interview problem: determine if a ransom note can be built from a magazine's letters. Then extend it to report exactly which letters are missing and how many.

Python

from collections import Counter

def can_construct(ransom: str, magazine: str) -> bool:
    ransom_counts = Counter(ransom)
    magazine_counts = Counter(magazine)
    # If subtracting magazine from ransom leaves nothing, all letters are available
    return not (ransom_counts - magazine_counts)

def min_extra_letters(ransom: str, magazine: str) -> dict:
    deficit = Counter(ransom) - Counter(magazine)
    return dict(deficit)

# Test cases
print(can_construct("hello", "hheelllloo world"))
print(can_construct("aab", "ab"))
print(min_extra_letters("hello", "hheelllloo world"))
print(min_extra_letters("aab", "ab"))
print(can_construct("secret", "cer te"))
print(min_extra_letters("secret", "cer te"))

Solution

from collections import Counter

def can_construct(ransom: str, magazine: str) -> bool:
    ransom_counts = Counter(ransom)
    magazine_counts = Counter(magazine)
    return not (ransom_counts - magazine_counts)

def min_extra_letters(ransom: str, magazine: str) -> dict:
    deficit = Counter(ransom) - Counter(magazine)
    return dict(deficit)

Why this works:

ransom_counts - magazine_counts produces a Counter with only the letters where ransom needs more than magazine provides. If this is empty (falsy), construction is possible.
The - operator automatically drops zero and negative results, so the remaining elements are exactly the deficit.
Time complexity: O(n + m) where n is ransom length and m is magazine length. Counter construction is O(n) and O(m); subtraction is O(k) where k is unique characters.
This is LeetCode #383 (Ransom Note). The Counter approach is the cleanest Python solution.

Starter Code

from collections import Counter

def can_construct(ransom: str, magazine: str) -> bool:
  """
  Return True if ransom note can be constructed from letters in magazine.
  Each letter in magazine can only be used once.
  Case-sensitive.

  Example:
      can_construct("hello", "hheelllloo world") -> True
      can_construct("aab", "ab") -> False
  """
  # TODO: implement this
  pass

def min_extra_letters(ransom: str, magazine: str) -> dict:
  """
  Return a dict of letters that are missing (and how many of each)
  to construct the ransom note from the magazine.
  Return empty dict if construction is possible.

  Example:
      min_extra_letters("aab", "ab") -> {'a': 1}
  """
  # TODO: implement this
  pass

Expected Output

True
False
{}
{'a': 1}
False
{'s': 1, 'e': 1}

Hints

Hint 1: Counter subtraction is the core operation here. If ransom_counter - magazine_counter is empty, all letters are available.

Hint 2: For min_extra_letters, use Counter.subtract() to keep the negative/zero values, then filter for positive values in the ransom counter after subtraction. Or simply use the - operator on (ransom - magazine) to get only the deficit.

#11Build a Co-occurrence Matrix with defaultdict(Counter)Hard

defaultdictcounternestednlpco-occurrence

Build a word co-occurrence matrix using defaultdict(Counter). For each word in each sentence, count how many times every other word appears within a sliding window.

Python

from collections import defaultdict, Counter

def cooccurrence(sentences: list, window: int = 2) -> dict:
    cooccur = defaultdict(Counter)
    for sentence in sentences:
        words = sentence.lower().split()
        for i, word in enumerate(words):
            start = max(0, i - window)
            end = min(len(words), i + window + 1)
            for j in range(start, end):
                if i != j:
                    cooccur[word][words[j]] += 1

    result = {}
    for word in sorted(cooccur):
        result[word] = cooccur[word].most_common(3)
    return result

# Test
sentences = ["the cat sat", "the cat ran"]
result = cooccurrence(sentences, window=1)
for word, neighbors in result.items():
    print(word + ": " + str(neighbors))

Solution

from collections import defaultdict, Counter

def cooccurrence(sentences: list, window: int = 2) -> dict:
    cooccur = defaultdict(Counter)
    for sentence in sentences:
        words = sentence.lower().split()
        for i, word in enumerate(words):
            start = max(0, i - window)
            end = min(len(words), i + window + 1)
            for j in range(start, end):
                if i != j:
                    cooccur[word][words[j]] += 1

    result = {}
    for word in sorted(cooccur):
        result[word] = cooccur[word].most_common(3)
    return result

Key points:

defaultdict(Counter) is the perfect structure here: the outer dict maps each word to a Counter that tracks neighbor frequencies. No initialization needed at either level.
The sliding window [max(0, i-window), min(len, i+window+1)) handles sentence boundaries gracefully.
most_common(3) per word uses heapq.nlargest internally -- efficient when vocabulary is large but you only need top neighbors.
Time complexity: O(n * w) where n is total words across all sentences and w is the window size. Each word examines at most 2*w neighbors.
Space complexity: O(V^2) in the worst case where V is vocabulary size, since any pair of words could co-occur. In practice, the window limits this significantly.
This pattern is foundational in NLP -- word2vec's skip-gram model is essentially trained on a co-occurrence matrix like this.

Starter Code

from collections import defaultdict, Counter

def cooccurrence(sentences: list, window: int = 2) -> dict:
  """
  Build a word co-occurrence matrix from a list of sentences.
  For each word, count how many times every other word appears
  within 'window' positions of it.

  Return: dict mapping word -> list of (neighbor_word, count)
          sorted by count descending, limited to top 3 per word.
  Only include words that appear as a center word at least once.
  Sort the outer keys alphabetically.

  Example:
      cooccurrence(["the cat sat", "the cat ran"], window=1)
      -> {
          'cat': [('the', 2), ('sat', 1), ('ran', 1)],
          'ran': [('cat', 1)],
          'sat': [('cat', 1)],
          'the': [('cat', 2)]
      }
  """
  # TODO: implement this
  pass

Expected Output

cat: [('the', 2), ('sat', 1), ('ran', 1)]
ran: [('cat', 1)]
sat: [('cat', 1)]
the: [('cat', 2)]

Hints

Hint 1: Use defaultdict(Counter) -- each word maps to a Counter of its neighbors. For each word at position i, iterate over positions max(0, i-window) to min(len, i+window+1), skipping i itself.

Hint 2: After building the co-occurrence data, use counter.most_common(3) to get the top 3 neighbors per word.

Practice: Counter and Defaultdict

Easy​

Medium​

Hard​

Easy

Medium

Hard