Hash Maps and Sets
Hash table internals, collision handling, frequency counting, grouping patterns, two-sum variations, and set operations - with ML-relevant practice problems and full Python solutions.
Reading time: ~35 min | Interview relevance: Very High | Roles: All AI/ML roles
The Real Interview Moment
You are in an on-site interview at Google. The interviewer, a senior MLE on the recommendations team, writes a problem on the whiteboard: "Given a list of feature names used across multiple models in our system, find all features that appear in exactly K models. The list contains (feature_name, model_id) pairs, and there can be millions of entries."
Your first instinct might be to sort the list by feature name, then scan for groups of size K. That works - O(n log n) time. But the interviewer is watching your face, waiting to see if you reach for the more natural tool.
You say: "I will use a hash map. First, I will build a mapping from each feature name to a set of model IDs. Then I will iterate through the map and collect features where the set size equals K. Total time O(n), space O(n)."
The interviewer nods. "Good. Can you code that up?"
You write it in four minutes. Clean, correct, handles edge cases. The interviewer moves to the follow-up: "What if the data does not fit in memory?" Now you are in system design territory, discussing distributed hash maps and MapReduce. But you earned that conversation by nailing the fundamentals first.
Hash maps are the single most versatile data structure in coding interviews. This chapter makes them second nature.
Hash Table Internals
Understanding how hash tables work under the hood helps you reason about performance and answer follow-up questions confidently.
How a Hash Table Works
Step by step:
- Hash the key - Convert the key into an integer using a hash function
- Compute the index - Take the hash modulo the table size:
index = hash(key) % table_size - Store in bucket - Place the (key, value) pair at that index
- Handle collisions - When two keys map to the same index
Collision Handling Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Chaining | Each bucket holds a linked list of entries | Simple, never "full" | Cache-unfriendly, extra memory for pointers |
| Open Addressing (linear probing) | If bucket occupied, try next bucket | Cache-friendly, no extra pointers | Clustering, degrades as load factor rises |
| Open Addressing (quadratic probing) | Try bucket + 1, + 4, + 9, ... | Less clustering than linear | Can fail to find empty bucket |
| Robin Hood Hashing | Steal from "rich" (close to home) entries | Low variance in probe lengths | Complex implementation |
Python's dict uses open addressing with a custom probing sequence. CPython uses a perturbation-based scheme that mixes in higher bits of the hash.
When an interviewer asks "What is the worst-case time complexity of a hash table lookup?", the answer is O(n) - if all keys hash to the same bucket, it degenerates to a linear scan. But emphasize that this is astronomically unlikely with a good hash function, and the expected (average-case) complexity is O(1).
Load Factor and Resizing
# Conceptual implementation of hash table resizing
class SimpleHashMap:
def __init__(self, initial_size=8):
self.size = initial_size
self.count = 0
self.buckets = [[] for _ in range(self.size)]
self.load_factor_threshold = 0.75
def _hash(self, key):
return hash(key) % self.size
def put(self, key, value):
idx = self._hash(key)
# Check if key exists and update
for i, (k, v) in enumerate(self.buckets[idx]):
if k == key:
self.buckets[idx][i] = (key, value)
return
self.buckets[idx].append((key, value))
self.count += 1
# Resize if load factor exceeds threshold
if self.count / self.size > self.load_factor_threshold:
self._resize()
def get(self, key):
idx = self._hash(key)
for k, v in self.buckets[idx]:
if k == key:
return v
raise KeyError(key)
def _resize(self):
"""Double the table size and rehash all entries."""
old_buckets = self.buckets
self.size *= 2
self.buckets = [[] for _ in range(self.size)]
self.count = 0
for bucket in old_buckets:
for key, value in bucket:
self.put(key, value)
A common interview mistake is claiming hash table operations are always O(1). They are O(1) on average (amortized). The worst case is O(n) due to collisions, and resizing is O(n) when it happens. Always qualify with "average case" or "amortized."
What Can Be a Hash Key in Python?
This is a surprisingly common interview stumbling point:
| Type | Hashable? | Why / Why Not |
|---|---|---|
int | Yes | Immutable |
float | Yes | Immutable (but beware: hash(1) == hash(1.0)) |
str | Yes | Immutable |
tuple | Yes (if contents hashable) | Immutable container |
frozenset | Yes | Immutable set |
list | No | Mutable - hash would change on modification |
dict | No | Mutable |
set | No | Mutable - use frozenset instead |
| Custom object | Yes (default) | Uses id() by default; override __hash__ and __eq__ |
# Converting unhashable types to hashable equivalents
list_key = [1, 2, 3]
# dict_map[list_key] = "value" # TypeError!
dict_map = {}
dict_map[tuple(list_key)] = "value" # Works - tuple is hashable
set_key = {1, 2, 3}
# dict_map[set_key] = "value" # TypeError!
dict_map[frozenset(set_key)] = "value" # Works - frozenset is hashable
Pattern 1: Frequency Counting
The most common hash map pattern in ML interviews. Count occurrences of elements.
from collections import Counter
# Basic frequency counting
def word_frequencies(text):
"""Count word frequencies in text.
ML context: Building vocabulary for NLP models,
computing term frequencies for TF-IDF.
Time: O(n), Space: O(unique words)
"""
words = text.lower().split()
return Counter(words)
# Top-K frequent elements
def top_k_frequent(nums, k):
"""Return the K most frequent elements.
ML context: Finding the most common features, labels, or tokens.
Time: O(n) using bucket sort approach
Space: O(n)
"""
counts = Counter(nums)
# Bucket sort: index = frequency, value = list of elements with that frequency
max_freq = max(counts.values())
buckets = [[] for _ in range(max_freq + 1)]
for num, freq in counts.items():
buckets[freq].append(num)
# Collect top-K from highest frequency buckets
result = []
for freq in range(max_freq, 0, -1):
for num in buckets[freq]:
result.append(num)
if len(result) == k:
return result
return result
# Note: Counter.most_common(k) uses a heap internally - O(n + k log n)
# The bucket sort approach above is O(n) when k is small relative to n
At Meta, "top-K frequent" is one of the most commonly asked problems across all roles, including ML. Expect it as a warm-up question. At Google, the follow-up is usually "What if the data is streaming?" - which requires a Count-Min Sketch or similar probabilistic data structure.
Frequency-Based Grouping
from collections import defaultdict
def group_anagrams(words):
"""Group words that are anagrams of each other.
ML context: Finding near-duplicate text entries in a dataset,
grouping similar feature names.
Key insight: Two words are anagrams if their sorted characters match.
Use sorted characters as the hash key.
Time: O(n * k log k) where k = max word length
Space: O(n * k)
"""
groups = defaultdict(list)
for word in words:
key = tuple(sorted(word)) # Anagrams share the same sorted form
groups[key].append(word)
return list(groups.values())
# Optimization: use character count tuple as key instead of sorting
def group_anagrams_optimal(words):
"""O(n * k) using character counts instead of sorting.
Time: O(n * k) where k = max word length
Space: O(n * k)
"""
groups = defaultdict(list)
for word in words:
# Create a tuple of character counts (26 letters)
count = [0] * 26
for char in word:
count[ord(char) - ord('a')] += 1
groups[tuple(count)].append(word)
return list(groups.values())
Pattern 2: Two-Sum and Its Variants
The "two-sum" pattern is the most iconic hash map problem. Understanding its variants prepares you for a wide range of questions.
# Classic Two Sum
def two_sum(nums, target):
"""Find indices of two numbers that sum to target.
Time: O(n), Space: O(n)
"""
seen = {} # value -> index
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return []
# Three Sum (reduces to Two Sum)
def three_sum(nums):
"""Find all unique triplets that sum to zero.
Time: O(n^2), Space: O(1) extra (excluding output)
"""
nums.sort()
result = []
for i in range(len(nums) - 2):
# Skip duplicates for first element
if i > 0 and nums[i] == nums[i - 1]:
continue
# Two-pointer approach for remaining two elements
left, right = i + 1, len(nums) - 1
while left < right:
total = nums[i] + nums[left] + nums[right]
if total < 0:
left += 1
elif total > 0:
right -= 1
else:
result.append([nums[i], nums[left], nums[right]])
# Skip duplicates
while left < right and nums[left] == nums[left + 1]:
left += 1
while left < right and nums[right] == nums[right - 1]:
right -= 1
left += 1
right -= 1
return result
# Four Sum - generalization
def four_sum(nums, target):
"""Find all unique quadruplets that sum to target.
Time: O(n^3), Space: O(1) extra
"""
nums.sort()
result = []
n = len(nums)
for i in range(n - 3):
if i > 0 and nums[i] == nums[i - 1]:
continue
for j in range(i + 1, n - 2):
if j > i + 1 and nums[j] == nums[j - 1]:
continue
left, right = j + 1, n - 1
while left < right:
total = nums[i] + nums[j] + nums[left] + nums[right]
if total < target:
left += 1
elif total > target:
right -= 1
else:
result.append([nums[i], nums[j], nums[left], nums[right]])
while left < right and nums[left] == nums[left + 1]:
left += 1
while left < right and nums[right] == nums[right - 1]:
right -= 1
left += 1
right -= 1
return result
Two-Sum Variants Comparison
| Variant | Approach | Time | Space |
|---|---|---|---|
| Two Sum (unsorted) | Hash map | O(n) | O(n) |
| Two Sum (sorted) | Two pointers | O(n) | O(1) |
| Two Sum (BST) | Inorder + two pointers | O(n) | O(n) for inorder |
| Three Sum | Sort + two pointers | O(n^2) | O(1) |
| Four Sum | Sort + nested + two pointers | O(n^3) | O(1) |
| K Sum (general) | Recursion + two pointers | O(n^(k-1)) | O(k) |
Pattern 3: Hash Map for Lookup Optimization
The core value of hash maps: turning O(n) lookups into O(1).
# Pattern: Check if complement/partner exists
def find_pairs_with_difference(arr, k):
"""Find all pairs with difference exactly k.
ML context: Finding feature pairs whose correlation
differs by exactly a threshold.
Time: O(n), Space: O(n)
"""
if k < 0:
return []
num_set = set(arr)
pairs = []
for num in arr:
if num + k in num_set:
pairs.append((num, num + k))
# Remove duplicates if needed
return list(set(pairs))
# Pattern: Hash map as cache / memoization
def is_isomorphic(s, t):
"""Check if two strings are isomorphic (same structure).
ML context: Checking if two feature encoding schemes
have the same structural mapping.
Two strings are isomorphic if characters in s can be replaced
to get t, with no two characters mapping to the same character.
Time: O(n), Space: O(n)
"""
if len(s) != len(t):
return False
s_to_t = {}
t_to_s = {}
for cs, ct in zip(s, t):
if cs in s_to_t:
if s_to_t[cs] != ct:
return False
else:
if ct in t_to_s:
return False
s_to_t[cs] = ct
t_to_s[ct] = cs
return True
# Example: is_isomorphic("egg", "add") -> True
# is_isomorphic("foo", "bar") -> False
Pattern 4: Set Operations
Sets provide O(1) membership testing and support mathematical set operations.
# Set operations comparison
def set_operations_demo():
"""Demonstrate Python set operations and their complexity."""
features_model_a = {'age', 'income', 'education', 'city', 'gender'}
features_model_b = {'age', 'income', 'credit_score', 'employment', 'city'}
# Intersection: features used by both models - O(min(n, m))
shared = features_model_a & features_model_b
# {'age', 'income', 'city'}
# Union: all features across both models - O(n + m)
all_features = features_model_a | features_model_b
# {'age', 'income', 'education', 'city', 'gender', 'credit_score', 'employment'}
# Difference: features unique to model A - O(n)
unique_to_a = features_model_a - features_model_b
# {'education', 'gender'}
# Symmetric difference: features in exactly one model - O(n + m)
exclusive = features_model_a ^ features_model_b
# {'education', 'gender', 'credit_score', 'employment'}
# Subset check - O(n)
is_subset = features_model_a <= all_features # True
return shared, all_features, unique_to_a, exclusive
Set for Deduplication
def deduplicate_features(feature_lists):
"""Remove duplicate features across multiple feature lists,
preserving the order of first occurrence.
ML context: Merging feature sets from different data sources
while maintaining a consistent ordering.
Time: O(total features), Space: O(unique features)
"""
seen = set()
result = []
for feature_list in feature_lists:
for feature in feature_list:
if feature not in seen:
seen.add(feature)
result.append(feature)
return result
# Example:
# lists = [['age', 'income', 'city'],
# ['income', 'education', 'age'],
# ['city', 'gender', 'income']]
# Result: ['age', 'income', 'city', 'education', 'gender']
Pattern 5: Hash Map for Caching (LRU Cache)
LRU (Least Recently Used) Cache is a classic interview problem that combines a hash map with a doubly-linked list.
from collections import OrderedDict
class LRUCache:
"""Least Recently Used Cache.
ML context: Caching model predictions for repeated inputs,
caching feature computations, embedding lookups.
All operations O(1).
"""
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
"""Get value and mark as recently used."""
if key not in self.cache:
return -1
self.cache.move_to_end(key) # Mark as recently used
return self.cache[key]
def put(self, key, value):
"""Add/update entry. Evict LRU if over capacity."""
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False) # Remove least recently used
# From-scratch implementation with doubly-linked list
class DNode:
def __init__(self, key=0, val=0):
self.key = key
self.val = val
self.prev = None
self.next = None
class LRUCacheScratch:
"""LRU Cache without OrderedDict - pure hash map + doubly linked list."""
def __init__(self, capacity):
self.capacity = capacity
self.cache = {} # key -> DNode
# Sentinel nodes to avoid null checks
self.head = DNode() # Dummy head (most recently used end)
self.tail = DNode() # Dummy tail (least recently used end)
self.head.next = self.tail
self.tail.prev = self.head
def _remove(self, node):
"""Remove node from doubly linked list."""
node.prev.next = node.next
node.next.prev = node.prev
def _add_to_front(self, node):
"""Add node right after head (most recently used position)."""
node.next = self.head.next
node.prev = self.head
self.head.next.prev = node
self.head.next = node
def get(self, key):
if key not in self.cache:
return -1
node = self.cache[key]
self._remove(node)
self._add_to_front(node)
return node.val
def put(self, key, value):
if key in self.cache:
self._remove(self.cache[key])
node = DNode(key, value)
self.cache[key] = node
self._add_to_front(node)
if len(self.cache) > self.capacity:
lru = self.tail.prev
self._remove(lru)
del self.cache[lru.key]
If you implement LRU Cache with a list and linear search (O(n) get/put), the interviewer will reject it. The entire point is O(1) for both operations. You must combine a hash map (O(1) lookup) with a doubly-linked list (O(1) insert/remove). Using Python's OrderedDict is acceptable but be prepared to explain how it works internally.
Pattern 6: Hash Map for Subarray Problems
def longest_subarray_with_k_distinct(arr, k):
"""Find the length of the longest subarray with at most k distinct elements.
ML context: Finding the longest contiguous sequence in a dataset
where at most K different categories appear.
Uses sliding window + hash map to track element frequencies.
Time: O(n), Space: O(k)
"""
from collections import defaultdict
char_count = defaultdict(int)
left = 0
max_length = 0
for right in range(len(arr)):
char_count[arr[right]] += 1
# Shrink window until we have at most k distinct elements
while len(char_count) > k:
char_count[arr[left]] -= 1
if char_count[arr[left]] == 0:
del char_count[arr[left]]
left += 1
max_length = max(max_length, right - left + 1)
return max_length
def subarrays_with_exactly_k_distinct(arr, k):
"""Count subarrays with exactly k distinct elements.
Trick: exactly(k) = at_most(k) - at_most(k-1)
Time: O(n), Space: O(k)
"""
def at_most_k(arr, k):
from collections import defaultdict
count = defaultdict(int)
left = 0
result = 0
for right in range(len(arr)):
count[arr[right]] += 1
while len(count) > k:
count[arr[left]] -= 1
if count[arr[left]] == 0:
del count[arr[left]]
left += 1
result += right - left + 1
return result
return at_most_k(arr, k) - at_most_k(arr, k - 1)
ML-Specific Hash Map Applications
Building a Vocabulary
def build_vocabulary(documents, min_freq=2, max_vocab_size=10000):
"""Build a vocabulary from a list of documents.
ML context: Creating token-to-index mapping for NLP models.
Time: O(total tokens), Space: O(unique tokens)
"""
from collections import Counter
# Count all tokens
token_counts = Counter()
for doc in documents:
tokens = doc.lower().split()
token_counts.update(tokens)
# Filter by minimum frequency
filtered = {token: count for token, count in token_counts.items()
if count >= min_freq}
# Sort by frequency (descending) and take top max_vocab_size
sorted_tokens = sorted(filtered.keys(),
key=lambda t: filtered[t],
reverse=True)[:max_vocab_size]
# Create token-to-index mapping
vocab = {'<PAD>': 0, '<UNK>': 1} # Special tokens
for i, token in enumerate(sorted_tokens):
vocab[token] = i + 2 # Offset by special tokens
return vocab
def encode_text(text, vocab):
"""Encode text using vocabulary mapping.
Time: O(n), Space: O(n) where n = number of tokens
"""
tokens = text.lower().split()
return [vocab.get(token, vocab['<UNK>']) for token in tokens]
Feature Hashing (Hashing Trick)
def feature_hash(features, n_buckets=1024):
"""Hash features to a fixed-size vector.
ML context: The hashing trick - map high-dimensional categorical
features to a fixed-size vector without maintaining a vocabulary.
Used in Vowpal Wabbit, scikit-learn's HashingVectorizer.
Time: O(number of features), Space: O(n_buckets)
"""
vector = [0.0] * n_buckets
for feature_name, value in features.items():
# Hash feature name to get bucket index
bucket = hash(feature_name) % n_buckets
# Use sign hash to reduce collisions
sign = 1 if hash(feature_name + '_sign') % 2 == 0 else -1
vector[bucket] += sign * value
return vector
# Example:
# features = {'user_age': 25, 'user_city_NYC': 1, 'user_premium': 1}
# vector = feature_hash(features, n_buckets=8)
# Result: [0, 25, 0, -1, 1, 0, 0, 0] (example - actual values depend on hash function)
Embedding Lookup Table
class EmbeddingTable:
"""Simple embedding lookup using a hash map.
ML context: This is conceptually how nn.Embedding works -
a mapping from integer indices to dense vectors.
Time: O(1) lookup, O(n * d) initialization
Space: O(n * d) where n = vocab size, d = embedding dim
"""
def __init__(self, vocab_size, embedding_dim, seed=42):
import random
random.seed(seed)
self.embeddings = {}
for i in range(vocab_size):
self.embeddings[i] = [
random.gauss(0, 0.01) for _ in range(embedding_dim)
]
def lookup(self, indices):
"""Look up embeddings for a list of indices."""
return [self.embeddings.get(idx, self.embeddings[0]) for idx in indices]
def lookup_batch(self, batch_of_indices):
"""Look up embeddings for a batch of sequences."""
return [[self.embeddings.get(idx, self.embeddings[0])
for idx in seq] for seq in batch_of_indices]
Practice Problems
Problem 1: Feature Deduplication Across Models
Given a list of (feature_name, model_name) tuples, find features that are used in all models.
# Input: pairs = [("age", "modelA"), ("income", "modelA"), ("age", "modelB"),
# ("city", "modelB"), ("age", "modelC"), ("income", "modelC")]
# total_models = 3
# Output: ["age"]
# Explanation: "age" appears in all 3 models
Hint 1
Hint 2
Hint 3
Solution
from collections import defaultdict
def features_in_all_models(pairs, total_models):
"""Find features present in every model.
Time: O(n), Space: O(n)
"""
feature_to_models = defaultdict(set)
for feature, model in pairs:
feature_to_models[feature].add(model)
return [feature for feature, models in feature_to_models.items()
if len(models) == total_models]
Problem 2: Minimum Window Substring
Given a string S and a string T, find the minimum window in S that contains all characters of T.
# Input: s = "ADOBECODEBANC", t = "ABC"
# Output: "BANC"
Hint 1
Hint 2
Hint 3
Solution
from collections import Counter
def min_window_substring(s, t):
"""Find minimum window in s containing all characters of t.
Time: O(|S| + |T|), Space: O(|S| + |T|)
"""
if not s or not t or len(s) < len(t):
return ""
required = Counter(t)
needed = len(required) # Number of unique chars needed
formed = 0 # Number of unique chars satisfied
window_counts = {}
min_len = float('inf')
min_start = 0
left = 0
for right in range(len(s)):
char = s[right]
window_counts[char] = window_counts.get(char, 0) + 1
# Check if current character's frequency is satisfied
if char in required and window_counts[char] == required[char]:
formed += 1
# Try to shrink window
while formed == needed:
# Update minimum window
if right - left + 1 < min_len:
min_len = right - left + 1
min_start = left
# Remove left character
left_char = s[left]
window_counts[left_char] -= 1
if left_char in required and window_counts[left_char] < required[left_char]:
formed -= 1
left += 1
return s[min_start:min_start + min_len] if min_len != float('inf') else ""
Problem 3: Subarray Sum Equals K
Count the number of contiguous subarrays whose sum equals K.
# Input: nums = [1, 1, 1], k = 2
# Output: 2 (subarrays [1,1] at index 0-1 and 1-2)
Hint 1
Hint 2
Hint 3
Solution
def subarray_sum(nums, k):
"""Count subarrays with sum equal to k using prefix sum + hash map.
Time: O(n), Space: O(n)
"""
count = 0
prefix = 0
prefix_counts = {0: 1}
for num in nums:
prefix += num
count += prefix_counts.get(prefix - k, 0)
prefix_counts[prefix] = prefix_counts.get(prefix, 0) + 1
return count
Problem 4: Valid Sudoku
Determine if a 9x9 Sudoku board is valid (only check filled cells).
# ML relevance: Constraint satisfaction - similar to checking that
# a feature matrix satisfies uniqueness constraints per category.
Hint 1
Hint 2
Hint 3
Solution
def is_valid_sudoku(board):
"""Check if a Sudoku board is valid.
Time: O(81) = O(1) since board size is fixed
Space: O(81) = O(1)
"""
rows = [set() for _ in range(9)]
cols = [set() for _ in range(9)]
boxes = [set() for _ in range(9)]
for r in range(9):
for c in range(9):
val = board[r][c]
if val == '.':
continue
box_idx = (r // 3) * 3 + (c // 3)
if val in rows[r] or val in cols[c] or val in boxes[box_idx]:
return False
rows[r].add(val)
cols[c].add(val)
boxes[box_idx].add(val)
return True
Problem 5: Longest Consecutive Sequence
Given an unsorted array of integers, find the length of the longest consecutive elements sequence in O(n) time.
# Input: [100, 4, 200, 1, 3, 2]
# Output: 4 (sequence [1, 2, 3, 4])
Hint 1
Hint 2
Hint 3
Solution
def longest_consecutive(nums):
"""Find longest consecutive sequence using a set.
Time: O(n), Space: O(n)
"""
if not nums:
return 0
num_set = set(nums)
max_length = 0
for num in num_set:
# Only start counting from the beginning of a sequence
if num - 1 not in num_set:
current = num
length = 1
while current + 1 in num_set:
current += 1
length += 1
max_length = max(max_length, length)
return max_length
Problem 6: Design a Time-Based Key-Value Store
Create a data structure that stores (key, value, timestamp) and supports:
set(key, value, timestamp)- store the value for the given key at the given timestampget(key, timestamp)- return the value associated with the largest timestamp <= given timestamp
# ML context: A feature store that tracks feature values over time.
# "What was the user's credit score at inference time T?"
Hint 1
Hint 2
Hint 3
Solution
from collections import defaultdict
import bisect
class TimeMap:
"""Time-based key-value store.
set: O(1) amortized
get: O(log n) where n = number of timestamps for that key
"""
def __init__(self):
self.store = defaultdict(list) # key -> [(timestamp, value), ...]
def set(self, key, value, timestamp):
self.store[key].append((timestamp, value))
def get(self, key, timestamp):
if key not in self.store:
return ""
entries = self.store[key]
# Binary search for rightmost timestamp <= query timestamp
idx = bisect.bisect_right(entries, (timestamp, chr(127))) - 1
if idx < 0:
return ""
return entries[idx][1]
Interview Cheat Sheet
| Pattern | When to Use | Key Detail |
|---|---|---|
| Frequency counting | "Most common", "frequency of", "count" | Use Counter, bucket sort for top-K in O(n) |
| Two-Sum variants | "Two numbers that sum to", "pair with property" | Hash map for O(1) complement lookup |
| Grouping | "Group by property", "anagrams", "categorize" | defaultdict(list), choose the right key |
| Set membership | "Contains", "exists in", "unique" | set for O(1) in check |
| Set operations | "Common elements", "difference", "overlap" | & (intersection), | (union), - (difference) |
| Sliding window + hash | "Window with K distinct", "contains all chars" | Hash map tracks window contents |
| Prefix sum + hash | "Subarray sum equals K" | Map stores prefix sum frequencies |
| LRU Cache | "Eviction policy", "bounded cache" | OrderedDict or hash map + doubly linked list |
Spaced Repetition Checkpoints
| Day | Review Activity | Time |
|---|---|---|
| Day 0 | Read all patterns. Solve Two Sum, Group Anagrams, and Top K Frequent without looking. | 35 min |
| Day 3 | Implement LRU Cache from scratch (not using OrderedDict). | 20 min |
| Day 7 | Solve Minimum Window Substring and Subarray Sum Equals K from scratch. | 25 min |
| Day 14 | Explain hash table internals to someone: how hashing works, collision handling, load factor, and resizing. | 15 min |
| Day 21 | Time yourself on 3 hash map problems from LeetCode (Medium). Target: under 20 minutes each. | 60 min |
Next Steps
Hash maps give you O(1) lookups. Next, learn Trees and Graphs to handle hierarchical and network-structured data - from decision trees to ML pipeline DAGs.
