Hash Maps and Sets

Hash table internals, collision handling, frequency counting, grouping patterns, two-sum variations, and set operations - with ML-relevant practice problems and full Python solutions.

Reading time: ~35 min | Interview relevance: Very High | Roles: All AI/ML roles

The Real Interview Moment

You are in an on-site interview at Google. The interviewer, a senior MLE on the recommendations team, writes a problem on the whiteboard: "Given a list of feature names used across multiple models in our system, find all features that appear in exactly K models. The list contains (feature_name, model_id) pairs, and there can be millions of entries."

Your first instinct might be to sort the list by feature name, then scan for groups of size K. That works - O(n log n) time. But the interviewer is watching your face, waiting to see if you reach for the more natural tool.

You say: "I will use a hash map. First, I will build a mapping from each feature name to a set of model IDs. Then I will iterate through the map and collect features where the set size equals K. Total time O(n), space O(n)."

The interviewer nods. "Good. Can you code that up?"

You write it in four minutes. Clean, correct, handles edge cases. The interviewer moves to the follow-up: "What if the data does not fit in memory?" Now you are in system design territory, discussing distributed hash maps and MapReduce. But you earned that conversation by nailing the fundamentals first.

Hash maps are the single most versatile data structure in coding interviews. This chapter makes them second nature.

Hash Table Internals

Understanding how hash tables work under the hood helps you reason about performance and answer follow-up questions confidently.

How a Hash Table Works

Step by step:

Hash the key - Convert the key into an integer using a hash function
Compute the index - Take the hash modulo the table size: index = hash(key) % table_size
Store in bucket - Place the (key, value) pair at that index
Handle collisions - When two keys map to the same index

Collision Handling Strategies

Strategy	How It Works	Pros	Cons
Chaining	Each bucket holds a linked list of entries	Simple, never "full"	Cache-unfriendly, extra memory for pointers
Open Addressing (linear probing)	If bucket occupied, try next bucket	Cache-friendly, no extra pointers	Clustering, degrades as load factor rises
Open Addressing (quadratic probing)	Try bucket + 1, + 4, + 9, ...	Less clustering than linear	Can fail to find empty bucket
Robin Hood Hashing	Steal from "rich" (close to home) entries	Low variance in probe lengths	Complex implementation

Python's dict uses open addressing with a custom probing sequence. CPython uses a perturbation-based scheme that mixes in higher bits of the hash.

60-Second Answer

When an interviewer asks "What is the worst-case time complexity of a hash table lookup?", the answer is O(n) - if all keys hash to the same bucket, it degenerates to a linear scan. But emphasize that this is astronomically unlikely with a good hash function, and the expected (average-case) complexity is O(1).

Load Factor and Resizing

# Conceptual implementation of hash table resizing
class SimpleHashMap:
    def __init__(self, initial_size=8):
        self.size = initial_size
        self.count = 0
        self.buckets = [[] for _ in range(self.size)]
        self.load_factor_threshold = 0.75

    def _hash(self, key):
        return hash(key) % self.size

    def put(self, key, value):
        idx = self._hash(key)
        # Check if key exists and update
        for i, (k, v) in enumerate(self.buckets[idx]):
            if k == key:
                self.buckets[idx][i] = (key, value)
                return

        self.buckets[idx].append((key, value))
        self.count += 1

        # Resize if load factor exceeds threshold
        if self.count / self.size > self.load_factor_threshold:
            self._resize()

    def get(self, key):
        idx = self._hash(key)
        for k, v in self.buckets[idx]:
            if k == key:
                return v
        raise KeyError(key)

    def _resize(self):
        """Double the table size and rehash all entries."""
        old_buckets = self.buckets
        self.size *= 2
        self.buckets = [[] for _ in range(self.size)]
        self.count = 0
        for bucket in old_buckets:
            for key, value in bucket:
                self.put(key, value)

Common Trap

A common interview mistake is claiming hash table operations are always O(1). They are O(1) on average (amortized). The worst case is O(n) due to collisions, and resizing is O(n) when it happens. Always qualify with "average case" or "amortized."

What Can Be a Hash Key in Python?

This is a surprisingly common interview stumbling point:

Type	Hashable?	Why / Why Not
`int`	Yes	Immutable
`float`	Yes	Immutable (but beware: `hash(1) == hash(1.0)`)
`str`	Yes	Immutable
`tuple`	Yes (if contents hashable)	Immutable container
`frozenset`	Yes	Immutable set
`list`	No	Mutable - hash would change on modification
`dict`	No	Mutable
`set`	No	Mutable - use `frozenset` instead
Custom object	Yes (default)	Uses `id()` by default; override `__hash__` and `__eq__`

# Converting unhashable types to hashable equivalents
list_key = [1, 2, 3]
# dict_map[list_key] = "value"  # TypeError!
dict_map = {}
dict_map[tuple(list_key)] = "value"  # Works - tuple is hashable

set_key = {1, 2, 3}
# dict_map[set_key] = "value"  # TypeError!
dict_map[frozenset(set_key)] = "value"  # Works - frozenset is hashable

Pattern 1: Frequency Counting

The most common hash map pattern in ML interviews. Count occurrences of elements.

from collections import Counter

# Basic frequency counting
def word_frequencies(text):
    """Count word frequencies in text.

    ML context: Building vocabulary for NLP models,
    computing term frequencies for TF-IDF.

    Time: O(n), Space: O(unique words)
    """
    words = text.lower().split()
    return Counter(words)

# Top-K frequent elements
def top_k_frequent(nums, k):
    """Return the K most frequent elements.

    ML context: Finding the most common features, labels, or tokens.

    Time: O(n) using bucket sort approach
    Space: O(n)
    """
    counts = Counter(nums)

    # Bucket sort: index = frequency, value = list of elements with that frequency
    max_freq = max(counts.values())
    buckets = [[] for _ in range(max_freq + 1)]
    for num, freq in counts.items():
        buckets[freq].append(num)

    # Collect top-K from highest frequency buckets
    result = []
    for freq in range(max_freq, 0, -1):
        for num in buckets[freq]:
            result.append(num)
            if len(result) == k:
                return result
    return result

# Note: Counter.most_common(k) uses a heap internally - O(n + k log n)
# The bucket sort approach above is O(n) when k is small relative to n

Company Variation

At Meta, "top-K frequent" is one of the most commonly asked problems across all roles, including ML. Expect it as a warm-up question. At Google, the follow-up is usually "What if the data is streaming?" - which requires a Count-Min Sketch or similar probabilistic data structure.

Frequency-Based Grouping

from collections import defaultdict

def group_anagrams(words):
    """Group words that are anagrams of each other.

    ML context: Finding near-duplicate text entries in a dataset,
    grouping similar feature names.

    Key insight: Two words are anagrams if their sorted characters match.
    Use sorted characters as the hash key.

    Time: O(n * k log k) where k = max word length
    Space: O(n * k)
    """
    groups = defaultdict(list)
    for word in words:
        key = tuple(sorted(word))  # Anagrams share the same sorted form
        groups[key].append(word)
    return list(groups.values())

# Optimization: use character count tuple as key instead of sorting
def group_anagrams_optimal(words):
    """O(n * k) using character counts instead of sorting.

    Time: O(n * k) where k = max word length
    Space: O(n * k)
    """
    groups = defaultdict(list)
    for word in words:
        # Create a tuple of character counts (26 letters)
        count = [0] * 26
        for char in word:
            count[ord(char) - ord('a')] += 1
        groups[tuple(count)].append(word)
    return list(groups.values())

Pattern 2: Two-Sum and Its Variants

The "two-sum" pattern is the most iconic hash map problem. Understanding its variants prepares you for a wide range of questions.

# Classic Two Sum
def two_sum(nums, target):
    """Find indices of two numbers that sum to target.

    Time: O(n), Space: O(n)
    """
    seen = {}  # value -> index
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return []

# Three Sum (reduces to Two Sum)
def three_sum(nums):
    """Find all unique triplets that sum to zero.

    Time: O(n^2), Space: O(1) extra (excluding output)
    """
    nums.sort()
    result = []
    for i in range(len(nums) - 2):
        # Skip duplicates for first element
        if i > 0 and nums[i] == nums[i - 1]:
            continue
        # Two-pointer approach for remaining two elements
        left, right = i + 1, len(nums) - 1
        while left < right:
            total = nums[i] + nums[left] + nums[right]
            if total < 0:
                left += 1
            elif total > 0:
                right -= 1
            else:
                result.append([nums[i], nums[left], nums[right]])
                # Skip duplicates
                while left < right and nums[left] == nums[left + 1]:
                    left += 1
                while left < right and nums[right] == nums[right - 1]:
                    right -= 1
                left += 1
                right -= 1
    return result

# Four Sum - generalization
def four_sum(nums, target):
    """Find all unique quadruplets that sum to target.

    Time: O(n^3), Space: O(1) extra
    """
    nums.sort()
    result = []
    n = len(nums)
    for i in range(n - 3):
        if i > 0 and nums[i] == nums[i - 1]:
            continue
        for j in range(i + 1, n - 2):
            if j > i + 1 and nums[j] == nums[j - 1]:
                continue
            left, right = j + 1, n - 1
            while left < right:
                total = nums[i] + nums[j] + nums[left] + nums[right]
                if total < target:
                    left += 1
                elif total > target:
                    right -= 1
                else:
                    result.append([nums[i], nums[j], nums[left], nums[right]])
                    while left < right and nums[left] == nums[left + 1]:
                        left += 1
                    while left < right and nums[right] == nums[right - 1]:
                        right -= 1
                    left += 1
                    right -= 1
    return result

Two-Sum Variants Comparison

Variant	Approach	Time	Space
Two Sum (unsorted)	Hash map	O(n)	O(n)
Two Sum (sorted)	Two pointers	O(n)	O(1)
Two Sum (BST)	Inorder + two pointers	O(n)	O(n) for inorder
Three Sum	Sort + two pointers	O(n^2)	O(1)
Four Sum	Sort + nested + two pointers	O(n^3)	O(1)
K Sum (general)	Recursion + two pointers	O(n^(k-1))	O(k)

Pattern 3: Hash Map for Lookup Optimization

The core value of hash maps: turning O(n) lookups into O(1).

# Pattern: Check if complement/partner exists
def find_pairs_with_difference(arr, k):
    """Find all pairs with difference exactly k.

    ML context: Finding feature pairs whose correlation
    differs by exactly a threshold.

    Time: O(n), Space: O(n)
    """
    if k < 0:
        return []
    num_set = set(arr)
    pairs = []
    for num in arr:
        if num + k in num_set:
            pairs.append((num, num + k))
    # Remove duplicates if needed
    return list(set(pairs))

# Pattern: Hash map as cache / memoization
def is_isomorphic(s, t):
    """Check if two strings are isomorphic (same structure).

    ML context: Checking if two feature encoding schemes
    have the same structural mapping.

    Two strings are isomorphic if characters in s can be replaced
    to get t, with no two characters mapping to the same character.

    Time: O(n), Space: O(n)
    """
    if len(s) != len(t):
        return False
    s_to_t = {}
    t_to_s = {}
    for cs, ct in zip(s, t):
        if cs in s_to_t:
            if s_to_t[cs] != ct:
                return False
        else:
            if ct in t_to_s:
                return False
            s_to_t[cs] = ct
            t_to_s[ct] = cs
    return True

# Example: is_isomorphic("egg", "add") -> True
#          is_isomorphic("foo", "bar") -> False

Pattern 4: Set Operations

Sets provide O(1) membership testing and support mathematical set operations.

# Set operations comparison
def set_operations_demo():
    """Demonstrate Python set operations and their complexity."""
    features_model_a = {'age', 'income', 'education', 'city', 'gender'}
    features_model_b = {'age', 'income', 'credit_score', 'employment', 'city'}

    # Intersection: features used by both models - O(min(n, m))
    shared = features_model_a & features_model_b
    # {'age', 'income', 'city'}

    # Union: all features across both models - O(n + m)
    all_features = features_model_a | features_model_b
    # {'age', 'income', 'education', 'city', 'gender', 'credit_score', 'employment'}

    # Difference: features unique to model A - O(n)
    unique_to_a = features_model_a - features_model_b
    # {'education', 'gender'}

    # Symmetric difference: features in exactly one model - O(n + m)
    exclusive = features_model_a ^ features_model_b
    # {'education', 'gender', 'credit_score', 'employment'}

    # Subset check - O(n)
    is_subset = features_model_a <= all_features  # True

    return shared, all_features, unique_to_a, exclusive

Set for Deduplication

def deduplicate_features(feature_lists):
    """Remove duplicate features across multiple feature lists,
    preserving the order of first occurrence.

    ML context: Merging feature sets from different data sources
    while maintaining a consistent ordering.

    Time: O(total features), Space: O(unique features)
    """
    seen = set()
    result = []
    for feature_list in feature_lists:
        for feature in feature_list:
            if feature not in seen:
                seen.add(feature)
                result.append(feature)
    return result

# Example:
# lists = [['age', 'income', 'city'],
#          ['income', 'education', 'age'],
#          ['city', 'gender', 'income']]
# Result: ['age', 'income', 'city', 'education', 'gender']

Pattern 5: Hash Map for Caching (LRU Cache)

LRU (Least Recently Used) Cache is a classic interview problem that combines a hash map with a doubly-linked list.

from collections import OrderedDict

class LRUCache:
    """Least Recently Used Cache.

    ML context: Caching model predictions for repeated inputs,
    caching feature computations, embedding lookups.

    All operations O(1).
    """
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()

    def get(self, key):
        """Get value and mark as recently used."""
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)  # Mark as recently used
        return self.cache[key]

    def put(self, key, value):
        """Add/update entry. Evict LRU if over capacity."""
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # Remove least recently used

# From-scratch implementation with doubly-linked list
class DNode:
    def __init__(self, key=0, val=0):
        self.key = key
        self.val = val
        self.prev = None
        self.next = None

class LRUCacheScratch:
    """LRU Cache without OrderedDict - pure hash map + doubly linked list."""
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # key -> DNode
        # Sentinel nodes to avoid null checks
        self.head = DNode()  # Dummy head (most recently used end)
        self.tail = DNode()  # Dummy tail (least recently used end)
        self.head.next = self.tail
        self.tail.prev = self.head

    def _remove(self, node):
        """Remove node from doubly linked list."""
        node.prev.next = node.next
        node.next.prev = node.prev

    def _add_to_front(self, node):
        """Add node right after head (most recently used position)."""
        node.next = self.head.next
        node.prev = self.head
        self.head.next.prev = node
        self.head.next = node

    def get(self, key):
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._remove(node)
        self._add_to_front(node)
        return node.val

    def put(self, key, value):
        if key in self.cache:
            self._remove(self.cache[key])
        node = DNode(key, value)
        self.cache[key] = node
        self._add_to_front(node)
        if len(self.cache) > self.capacity:
            lru = self.tail.prev
            self._remove(lru)
            del self.cache[lru.key]

Instant Rejection

If you implement LRU Cache with a list and linear search (O(n) get/put), the interviewer will reject it. The entire point is O(1) for both operations. You must combine a hash map (O(1) lookup) with a doubly-linked list (O(1) insert/remove). Using Python's OrderedDict is acceptable but be prepared to explain how it works internally.

Pattern 6: Hash Map for Subarray Problems

def longest_subarray_with_k_distinct(arr, k):
    """Find the length of the longest subarray with at most k distinct elements.

    ML context: Finding the longest contiguous sequence in a dataset
    where at most K different categories appear.

    Uses sliding window + hash map to track element frequencies.

    Time: O(n), Space: O(k)
    """
    from collections import defaultdict

    char_count = defaultdict(int)
    left = 0
    max_length = 0

    for right in range(len(arr)):
        char_count[arr[right]] += 1

        # Shrink window until we have at most k distinct elements
        while len(char_count) > k:
            char_count[arr[left]] -= 1
            if char_count[arr[left]] == 0:
                del char_count[arr[left]]
            left += 1

        max_length = max(max_length, right - left + 1)

    return max_length

def subarrays_with_exactly_k_distinct(arr, k):
    """Count subarrays with exactly k distinct elements.

    Trick: exactly(k) = at_most(k) - at_most(k-1)

    Time: O(n), Space: O(k)
    """
    def at_most_k(arr, k):
        from collections import defaultdict
        count = defaultdict(int)
        left = 0
        result = 0
        for right in range(len(arr)):
            count[arr[right]] += 1
            while len(count) > k:
                count[arr[left]] -= 1
                if count[arr[left]] == 0:
                    del count[arr[left]]
                left += 1
            result += right - left + 1
        return result

    return at_most_k(arr, k) - at_most_k(arr, k - 1)

ML-Specific Hash Map Applications

Building a Vocabulary

def build_vocabulary(documents, min_freq=2, max_vocab_size=10000):
    """Build a vocabulary from a list of documents.

    ML context: Creating token-to-index mapping for NLP models.

    Time: O(total tokens), Space: O(unique tokens)
    """
    from collections import Counter

    # Count all tokens
    token_counts = Counter()
    for doc in documents:
        tokens = doc.lower().split()
        token_counts.update(tokens)

    # Filter by minimum frequency
    filtered = {token: count for token, count in token_counts.items()
                if count >= min_freq}

    # Sort by frequency (descending) and take top max_vocab_size
    sorted_tokens = sorted(filtered.keys(),
                          key=lambda t: filtered[t],
                          reverse=True)[:max_vocab_size]

    # Create token-to-index mapping
    vocab = {'<PAD>': 0, '<UNK>': 1}  # Special tokens
    for i, token in enumerate(sorted_tokens):
        vocab[token] = i + 2  # Offset by special tokens

    return vocab

def encode_text(text, vocab):
    """Encode text using vocabulary mapping.

    Time: O(n), Space: O(n) where n = number of tokens
    """
    tokens = text.lower().split()
    return [vocab.get(token, vocab['<UNK>']) for token in tokens]

Feature Hashing (Hashing Trick)

def feature_hash(features, n_buckets=1024):
    """Hash features to a fixed-size vector.

    ML context: The hashing trick - map high-dimensional categorical
    features to a fixed-size vector without maintaining a vocabulary.
    Used in Vowpal Wabbit, scikit-learn's HashingVectorizer.

    Time: O(number of features), Space: O(n_buckets)
    """
    vector = [0.0] * n_buckets
    for feature_name, value in features.items():
        # Hash feature name to get bucket index
        bucket = hash(feature_name) % n_buckets
        # Use sign hash to reduce collisions
        sign = 1 if hash(feature_name + '_sign') % 2 == 0 else -1
        vector[bucket] += sign * value
    return vector

# Example:
# features = {'user_age': 25, 'user_city_NYC': 1, 'user_premium': 1}
# vector = feature_hash(features, n_buckets=8)
# Result: [0, 25, 0, -1, 1, 0, 0, 0] (example - actual values depend on hash function)

Embedding Lookup Table

class EmbeddingTable:
    """Simple embedding lookup using a hash map.

    ML context: This is conceptually how nn.Embedding works -
    a mapping from integer indices to dense vectors.

    Time: O(1) lookup, O(n * d) initialization
    Space: O(n * d) where n = vocab size, d = embedding dim
    """
    def __init__(self, vocab_size, embedding_dim, seed=42):
        import random
        random.seed(seed)
        self.embeddings = {}
        for i in range(vocab_size):
            self.embeddings[i] = [
                random.gauss(0, 0.01) for _ in range(embedding_dim)
            ]

    def lookup(self, indices):
        """Look up embeddings for a list of indices."""
        return [self.embeddings.get(idx, self.embeddings[0]) for idx in indices]

    def lookup_batch(self, batch_of_indices):
        """Look up embeddings for a batch of sequences."""
        return [[self.embeddings.get(idx, self.embeddings[0])
                 for idx in seq] for seq in batch_of_indices]

Practice Problems

Problem 1: Feature Deduplication Across Models

Given a list of (feature_name, model_name) tuples, find features that are used in all models.

# Input:  pairs = [("age", "modelA"), ("income", "modelA"), ("age", "modelB"),
#                  ("city", "modelB"), ("age", "modelC"), ("income", "modelC")]
#         total_models = 3
# Output: ["age"]
# Explanation: "age" appears in all 3 models

Hint 1

Build a mapping from feature_name to the set of models it appears in.

Hint 2

Use a `defaultdict(set)` to collect model names per feature.

Hint 3

Return features where `len(models) == total_models`.

Solution

from collections import defaultdict

def features_in_all_models(pairs, total_models):
    """Find features present in every model.

    Time: O(n), Space: O(n)
    """
    feature_to_models = defaultdict(set)
    for feature, model in pairs:
        feature_to_models[feature].add(model)

    return [feature for feature, models in feature_to_models.items()
            if len(models) == total_models]

Problem 2: Minimum Window Substring

Given a string S and a string T, find the minimum window in S that contains all characters of T.

# Input:  s = "ADOBECODEBANC", t = "ABC"
# Output: "BANC"

Hint 1

Use a sliding window with two hash maps: one for the required character counts (from T), one for the current window's character counts.

Hint 2

Track how many characters in T are "satisfied" (current count >= required count). When all are satisfied, try to shrink the window from the left.

Hint 3

The time complexity is O(|S| + |T|) because each character is visited at most twice (once by right pointer, once by left pointer).

Solution

from collections import Counter

def min_window_substring(s, t):
    """Find minimum window in s containing all characters of t.

    Time: O(|S| + |T|), Space: O(|S| + |T|)
    """
    if not s or not t or len(s) < len(t):
        return ""

    required = Counter(t)
    needed = len(required)  # Number of unique chars needed
    formed = 0              # Number of unique chars satisfied
    window_counts = {}

    min_len = float('inf')
    min_start = 0
    left = 0

    for right in range(len(s)):
        char = s[right]
        window_counts[char] = window_counts.get(char, 0) + 1

        # Check if current character's frequency is satisfied
        if char in required and window_counts[char] == required[char]:
            formed += 1

        # Try to shrink window
        while formed == needed:
            # Update minimum window
            if right - left + 1 < min_len:
                min_len = right - left + 1
                min_start = left

            # Remove left character
            left_char = s[left]
            window_counts[left_char] -= 1
            if left_char in required and window_counts[left_char] < required[left_char]:
                formed -= 1
            left += 1

    return s[min_start:min_start + min_len] if min_len != float('inf') else ""

Problem 3: Subarray Sum Equals K

Count the number of contiguous subarrays whose sum equals K.

# Input:  nums = [1, 1, 1], k = 2
# Output: 2 (subarrays [1,1] at index 0-1 and 1-2)

Hint 1

Use prefix sums. If prefix[j] - prefix[i] == k, then the subarray from i to j-1 sums to k.

Hint 2

Store prefix sum frequencies in a hash map. At each position, check how many times (current_prefix - k) has appeared.

Hint 3

Initialize the hash map with `{0: 1}` to handle subarrays starting from index 0.

Solution

def subarray_sum(nums, k):
    """Count subarrays with sum equal to k using prefix sum + hash map.

    Time: O(n), Space: O(n)
    """
    count = 0
    prefix = 0
    prefix_counts = {0: 1}

    for num in nums:
        prefix += num
        count += prefix_counts.get(prefix - k, 0)
        prefix_counts[prefix] = prefix_counts.get(prefix, 0) + 1

    return count

Problem 4: Valid Sudoku

Determine if a 9x9 Sudoku board is valid (only check filled cells).

# ML relevance: Constraint satisfaction - similar to checking that
# a feature matrix satisfies uniqueness constraints per category.

Hint 1

Use three sets: one for each row, one for each column, one for each 3x3 box.

Hint 2

The box index for cell (r, c) is `(r // 3, c // 3)`.

Hint 3

For each filled cell, check if the value already exists in its row set, column set, or box set.

Solution

def is_valid_sudoku(board):
    """Check if a Sudoku board is valid.

    Time: O(81) = O(1) since board size is fixed
    Space: O(81) = O(1)
    """
    rows = [set() for _ in range(9)]
    cols = [set() for _ in range(9)]
    boxes = [set() for _ in range(9)]

    for r in range(9):
        for c in range(9):
            val = board[r][c]
            if val == '.':
                continue

            box_idx = (r // 3) * 3 + (c // 3)

            if val in rows[r] or val in cols[c] or val in boxes[box_idx]:
                return False

            rows[r].add(val)
            cols[c].add(val)
            boxes[box_idx].add(val)

    return True

Problem 5: Longest Consecutive Sequence

Given an unsorted array of integers, find the length of the longest consecutive elements sequence in O(n) time.

# Input:  [100, 4, 200, 1, 3, 2]
# Output: 4 (sequence [1, 2, 3, 4])

Hint 1

Put all elements in a set for O(1) lookup. Then for each number, check if it is the start of a sequence (num - 1 not in set).

Hint 2

Only count from sequence starts. This ensures each element is visited at most twice total.

Hint 3

From each start, keep incrementing and checking `num + 1 in set`.

Solution

def longest_consecutive(nums):
    """Find longest consecutive sequence using a set.

    Time: O(n), Space: O(n)
    """
    if not nums:
        return 0

    num_set = set(nums)
    max_length = 0

    for num in num_set:
        # Only start counting from the beginning of a sequence
        if num - 1 not in num_set:
            current = num
            length = 1
            while current + 1 in num_set:
                current += 1
                length += 1
            max_length = max(max_length, length)

    return max_length

Problem 6: Design a Time-Based Key-Value Store

Create a data structure that stores (key, value, timestamp) and supports:

set(key, value, timestamp) - store the value for the given key at the given timestamp
get(key, timestamp) - return the value associated with the largest timestamp <= given timestamp

# ML context: A feature store that tracks feature values over time.
# "What was the user's credit score at inference time T?"

Hint 1

Use a hash map where each key maps to a list of (timestamp, value) pairs. Since timestamps are always increasing, the list stays sorted.

Hint 2

For `get`, use binary search on the timestamp list to find the largest timestamp <= the query timestamp.

Hint 3

Python's `bisect` module provides `bisect_right` which finds the insertion point - the answer is one position before it.

Solution

from collections import defaultdict
import bisect

class TimeMap:
    """Time-based key-value store.

    set: O(1) amortized
    get: O(log n) where n = number of timestamps for that key
    """
    def __init__(self):
        self.store = defaultdict(list)  # key -> [(timestamp, value), ...]

    def set(self, key, value, timestamp):
        self.store[key].append((timestamp, value))

    def get(self, key, timestamp):
        if key not in self.store:
            return ""
        entries = self.store[key]
        # Binary search for rightmost timestamp <= query timestamp
        idx = bisect.bisect_right(entries, (timestamp, chr(127))) - 1
        if idx < 0:
            return ""
        return entries[idx][1]

Interview Cheat Sheet

Pattern	When to Use	Key Detail
Frequency counting	"Most common", "frequency of", "count"	Use `Counter`, bucket sort for top-K in O(n)
Two-Sum variants	"Two numbers that sum to", "pair with property"	Hash map for O(1) complement lookup
Grouping	"Group by property", "anagrams", "categorize"	`defaultdict(list)`, choose the right key
Set membership	"Contains", "exists in", "unique"	`set` for O(1) `in` check
Set operations	"Common elements", "difference", "overlap"	`&` (intersection), `\|` (union), `-` (difference)
Sliding window + hash	"Window with K distinct", "contains all chars"	Hash map tracks window contents
Prefix sum + hash	"Subarray sum equals K"	Map stores prefix sum frequencies
LRU Cache	"Eviction policy", "bounded cache"	`OrderedDict` or hash map + doubly linked list

Spaced Repetition Checkpoints

Day	Review Activity	Time
Day 0	Read all patterns. Solve Two Sum, Group Anagrams, and Top K Frequent without looking.	35 min
Day 3	Implement LRU Cache from scratch (not using OrderedDict).	20 min
Day 7	Solve Minimum Window Substring and Subarray Sum Equals K from scratch.	25 min
Day 14	Explain hash table internals to someone: how hashing works, collision handling, load factor, and resizing.	15 min
Day 21	Time yourself on 3 hash map problems from LeetCode (Medium). Target: under 20 minutes each.	60 min

Next Steps

Hash maps give you O(1) lookups. Next, learn Trees and Graphs to handle hierarchical and network-structured data - from decision trees to ML pipeline DAGs.

The Real Interview Moment​

Hash Table Internals​

How a Hash Table Works​

Collision Handling Strategies​

Load Factor and Resizing​

What Can Be a Hash Key in Python?​

Pattern 1: Frequency Counting​

Frequency-Based Grouping​

Pattern 2: Two-Sum and Its Variants​

Two-Sum Variants Comparison​

Pattern 3: Hash Map for Lookup Optimization​

Pattern 4: Set Operations​

Set for Deduplication​

Pattern 5: Hash Map for Caching (LRU Cache)​

Pattern 6: Hash Map for Subarray Problems​

ML-Specific Hash Map Applications​

Building a Vocabulary​

Feature Hashing (Hashing Trick)​

Embedding Lookup Table​

Practice Problems​

Problem 1: Feature Deduplication Across Models​

Problem 2: Minimum Window Substring​

Problem 3: Subarray Sum Equals K​

Problem 4: Valid Sudoku​

Problem 5: Longest Consecutive Sequence​

Problem 6: Design a Time-Based Key-Value Store​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Next Steps​

The Real Interview Moment

Hash Table Internals

How a Hash Table Works

Collision Handling Strategies

Load Factor and Resizing

What Can Be a Hash Key in Python?

Pattern 1: Frequency Counting

Frequency-Based Grouping

Pattern 2: Two-Sum and Its Variants

Two-Sum Variants Comparison

Pattern 3: Hash Map for Lookup Optimization

Pattern 4: Set Operations

Set for Deduplication

Pattern 5: Hash Map for Caching (LRU Cache)

Pattern 6: Hash Map for Subarray Problems

ML-Specific Hash Map Applications

Building a Vocabulary

Feature Hashing (Hashing Trick)

Embedding Lookup Table

Practice Problems

Problem 1: Feature Deduplication Across Models

Problem 2: Minimum Window Substring

Problem 3: Subarray Sum Equals K

Problem 4: Valid Sudoku

Problem 5: Longest Consecutive Sequence

Problem 6: Design a Time-Based Key-Value Store

Interview Cheat Sheet

Spaced Repetition Checkpoints

Next Steps