Randomized Algorithms and Sketching

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, Data Engineer, Research Engineer

The 3-Petabyte Feature Store Problem

A major streaming platform's ML team was building a real-time recommendation system. Their feature pipeline needed to answer questions like: "How many unique users have watched this show in the last 24 hours?" and "What is the approximate watch frequency for this content ID across all streaming sessions?" The event log was 3 petabytes and growing at 200GB per hour.

The naive approach - store every event, group by content ID, count distinct users - would require scanning terabytes per query. Even with distributed systems, the latency was 30-40 seconds per feature query. That was fatal for a real-time recommendation system that needed sub-10ms responses.

The solution was two sketching data structures: HyperLogLog for "count distinct users" (using 12KB per counter regardless of how many users, accurate to within 2%) and Count-Min Sketch for "approximate watch frequency" (using a few MB total for all content IDs). The entire feature store shrank from petabytes to gigabytes. Query latency dropped to under 1 millisecond. The accuracy tradeoff was acceptable: recommendations based on "approximately 2.3M unique viewers" are just as good as those based on "exactly 2,301,847 unique viewers."

This is the core philosophy behind randomized algorithms: for many ML problems, you do not need exact answers. You need answers that are close enough to make good decisions, and you need them fast. Randomized algorithms provide probabilistic guarantees - "the error is at most $\epsilon$ with probability at least $1 - \delta$ " - that are often more than sufficient for ML workloads.

The range of applications is broad. Reservoir sampling makes streaming ML data pipelines possible by maintaining a uniform random sample of a stream without knowing its length. Johnson-Lindenstrauss projections let you reduce 10,000-dimensional embeddings to 500 dimensions while preserving pairwise distances, enabling efficient similarity search. Randomized SVD approximates matrix factorization in hours instead of days on large embedding matrices. Locality-sensitive hashing enables approximate nearest neighbor search in milliseconds rather than seconds.

Every production ML system at scale uses some of these techniques, often invisibly. NumPy's random sampling, scikit-learn's TruncatedSVD, Faiss's LSH indexing, and the Count-Min Sketch inside Apache Flink's feature aggregation - they are all applications of the ideas in this lesson. Understanding where these approximations come from and what guarantees they provide is the difference between using them confidently and using them blindly.

Why This Exists

The traditional computer science requirement is exactness: a sort must sort perfectly, a search must find the exact item, a count must be precisely correct. This made sense when datasets were small enough to fit in memory and exact computation was feasible.

ML broke this assumption. Training sets with billions of examples. Embedding spaces with millions of tokens. Feature stores tracking trillions of events. Vocabularies with hundreds of thousands of unique terms. At this scale, exact algorithms either fail (OOM), are too slow (hours instead of milliseconds), or require infrastructure costs that make the system unviable.

The key insight behind randomized algorithms is that randomness can be used constructively - not just as a source of uncertainty, but as a tool for achieving computational efficiency with provable approximation guarantees. This is different from simply ignoring part of the data. A properly constructed random sample of 10,000 items from a stream of 10 billion items is statistically representative of the whole stream. A random projection that maps 10,000 dimensions down to 1,000 dimensions provably preserves pairwise distances up to a controlled distortion factor.

The theoretical foundation is largely probabilistic analysis - concentration inequalities like Markov's, Chebyshev's, and Chernoff bounds that let us say "the probability of large error is small." This is exactly the same mathematical toolbox used to analyze neural network generalization. Understanding randomized algorithms deepens your intuition for why ML models generalize at all.

Historical Context

The history of randomized algorithms in CS is surprisingly recent. Most were developed in the 1980s-2000s, driven by the emergence of large-scale data processing.

Morris (1978) developed probabilistic counting - the first approximate counting algorithm. He showed that you can count up to $n$ using only $O(\log \log n)$ bits by using the logarithm of the count rather than the count itself, incrementing probabilistically.

Vitter (1985) formalized reservoir sampling, proving that a simple "replace with probability $k/i$ " rule maintains a uniform random sample of size $k$ from a stream of length $i$ without knowing the stream length in advance.

Johnson and Lindenstrauss (1984) proved the foundational lemma for random projections: any $n$ points in high-dimensional space can be projected to $O(\log n / \epsilon^2)$ dimensions while preserving all pairwise distances within $(1 \pm \epsilon)$ . The proof was via random projection matrices with Gaussian or Rademacher entries.

Cormode and Muthukrishnan (2003) introduced the Count-Min Sketch, building on earlier work on sketching for frequency estimation. The key innovation was using multiple independent hash functions to reduce the probability of collision errors.

Flajolet and Martin (1985) developed the first HyperLogLog precursor. The full HyperLogLog algorithm (Flajolet, Fusy, Gandouet, Meunier, 2007) achieved the near-optimal space bound for cardinality estimation: $O(\log \log n)$ bits per counter to estimate cardinalities up to billions with 2% error.

Reservoir Sampling

The Streaming Data Problem

You have a stream of training examples arriving one at a time. You do not know how many examples there will be. You want to maintain a uniform random sample of exactly $k$ examples from everything you have seen so far, at all times. How?

The naive approach - buffer everything, then sample at the end - requires $O(n)$ memory and waits until the stream ends. This is unworkable for real-time ML pipelines that process events continuously.

Reservoir sampling solves this with $O(k)$ memory and a single pass.

Algorithm R (Vitter, 1985):

Fill the reservoir with the first $k$ items
For each subsequent item $i$ (starting at $i = k+1$ ): include it in the sample with probability $k/i$ . If included, replace a uniformly random existing item in the reservoir.

The invariant: after processing $i$ items, each of the $i$ items has been selected with probability $k/i$ . You can verify this by induction.

import random
from typing import Iterator, TypeVar

T = TypeVar('T')

def reservoir_sample(stream: Iterator[T], k: int) -> list:
    """
    Vitter's Algorithm R: uniform random sample of size k from a stream.
    O(k) memory, single pass, no knowledge of stream length needed.

    ML use cases:
    - Sample training examples from a data firehose
    - Maintain balanced class samples for online learning
    - Sample from a shuffle buffer for data loading
    - Build calibration datasets from production inference logs
    """
    reservoir = []

    for i, item in enumerate(stream):
        if i < k:
            # Fill reservoir with first k items
            reservoir.append(item)
        else:
            # Each subsequent item replaces a reservoir item
            # with probability k / (i+1)
            j = random.randint(0, i)   # uniform in [0, i]
            if j < k:
                reservoir[j] = item    # replace reservoir[j]

    return reservoir


def streaming_class_balanced_sample(
    stream: Iterator[tuple],  # (features, label) pairs
    k_per_class: int,
    n_classes: int
) -> dict:
    """
    Stratified reservoir sampling: maintain k_per_class samples per class.
    Used for online learning with class imbalance - ensures minority classes
    are not lost in the stream sample.
    """
    reservoirs = {c: [] for c in range(n_classes)}
    counts = {c: 0 for c in range(n_classes)}

    for features, label in stream:
        counts[label] += 1
        i = counts[label]

        if i <= k_per_class:
            reservoirs[label].append((features, label))
        else:
            j = random.randint(0, i - 1)
            if j < k_per_class:
                reservoirs[label][j] = (features, label)

    return reservoirs


# Verify uniformity: each of N items should appear ~k/N fraction of the time
import numpy as np

def verify_reservoir_uniformity(N: int = 10000, k: int = 100,
                                 trials: int = 1000) -> None:
    """Check that each item appears with probability k/N across trials."""
    counts = np.zeros(N)
    for _ in range(trials):
        sample = reservoir_sample(iter(range(N)), k)
        for item in sample:
            counts[item] += 1

    expected = k / N
    actual = counts / trials
    print(f"Expected probability: {expected:.4f}")
    print(f"Mean actual probability: {actual.mean():.4f}")
    print(f"Std dev: {actual.std():.4f}")
    print(f"Max deviation from expected: {np.abs(actual - expected).max():.4f}")
    # Should all be close to k/N = 0.01, std ~0.001

verify_reservoir_uniformity()

Random Projections and Johnson-Lindenstrauss

The Curse of Dimensionality and a Remedy

High-dimensional vectors are expensive. An embedding model might produce 4096-dimensional vectors. Storing 100 million such vectors requires $4096 \times 100M \times 4$ bytes = 1.6TB. Searching nearest neighbors in 4096 dimensions is effectively linear-scan expensive for any index.

The Johnson-Lindenstrauss lemma says we can project these to a much lower dimension while preserving pairwise distances:

Lemma (Johnson-Lindenstrauss, 1984): For any set of $n$ points in $\mathbb{R}^d$ and any $\epsilon \in (0, 1)$ , there exists a mapping $f: \mathbb{R}^d \to \mathbb{R}^k$ where $k = O(\log n / \epsilon^2)$ such that for all pairs $x, y$ :

$(1 - \epsilon)\|x - y\|^2 \leq \|f(x) - f(y)\|^2 \leq (1 + \epsilon)\|x - y\|^2$

The key insight: this mapping is a random linear projection - just multiply by a random matrix with i.i.d. Gaussian entries (scaled by $1/\sqrt{k}$ ). No optimization required. The mathematics of Gaussian random matrices guarantees the distance preservation in expectation, and concentration inequalities ensure the probability of large deviation is exponentially small.

import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

def johnson_lindenstrauss_projection(
    X: np.ndarray,
    target_dim: int,
    method: str = 'gaussian'
) -> tuple:
    """
    JL random projection: reduce dimension while preserving distances.

    X: [n_samples, d_original] - high-dimensional input
    target_dim: k - target dimension (should be >= 8 * log(n) / eps^2)
    method: 'gaussian' (exact JL), 'sparse' (Achlioptas 2003, faster)

    Returns: (X_projected, projection_matrix)
    """
    n, d = X.shape

    if method == 'gaussian':
        # Standard JL: N(0, 1/k) entries
        # Each column of R is a random unit Gaussian vector
        R = np.random.randn(d, target_dim) / np.sqrt(target_dim)
    elif method == 'sparse':
        # Sparse JL (Achlioptas 2003): +/- 1 with prob 1/6, 0 with prob 2/3
        # 3x faster, same asymptotic guarantee
        R = np.zeros((d, target_dim))
        probs = np.random.rand(d, target_dim)
        R[probs < 1/6] = np.sqrt(3.0 / target_dim)
        R[probs > 5/6] = -np.sqrt(3.0 / target_dim)
    else:
        raise ValueError(f"Unknown method: {method}")

    X_proj = X @ R  # [n, target_dim]
    return X_proj, R


def measure_distance_distortion(X: np.ndarray, X_proj: np.ndarray,
                                  n_pairs: int = 1000) -> dict:
    """
    Measure how well random projection preserves pairwise distances.
    Sample random pairs and compare original vs projected distances.
    """
    n = X.shape[0]
    idx1 = np.random.randint(0, n, n_pairs)
    idx2 = np.random.randint(0, n, n_pairs)
    idx2 = np.where(idx1 == idx2, (idx2 + 1) % n, idx2)

    orig_dists = np.linalg.norm(X[idx1] - X[idx2], axis=1)
    proj_dists = np.linalg.norm(X_proj[idx1] - X_proj[idx2], axis=1)

    ratios = proj_dists / (orig_dists + 1e-10)

    return {
        'mean_ratio': ratios.mean(),
        'std_ratio': ratios.std(),
        'min_ratio': ratios.min(),
        'max_ratio': ratios.max(),
        'within_10pct': ((ratios >= 0.9) & (ratios <= 1.1)).mean(),
    }


# Practical example: reducing BERT-like embeddings for fast ANN search
np.random.seed(42)
n_samples, d_original = 10000, 768   # BERT embedding dimension
d_target = 128                        # target dimension

X = np.random.randn(n_samples, d_original)  # simulated embeddings
X_proj, R = johnson_lindenstrauss_projection(X, target_dim=d_target)

distortion = measure_distance_distortion(X, X_proj)
print(f"Original dim: {d_original}, Target dim: {d_target}")
print(f"Compression: {d_original/d_target:.1f}x")
print(f"Distance ratio mean: {distortion['mean_ratio']:.4f} (ideal: 1.0)")
print(f"Distance ratio std: {distortion['std_ratio']:.4f}")
print(f"Pairs within 10% distortion: {distortion['within_10pct']:.1%}")


# Sklearn's GaussianRandomProjection does this automatically
# with JL minimum dimension calculation
from sklearn.random_projection import GaussianRandomProjection

projector = GaussianRandomProjection(n_components=128, random_state=42)
X_sklearn = projector.fit_transform(X)
print(f"\nSklearn projected shape: {X_sklearn.shape}")

JL Lemma in Practice

The minimum dimension for $\epsilon$ distortion with probability $1 - \delta$ on $n$ points:

$k \geq \frac{4 \log(n/\delta)}{\epsilon^2 / 2 - \epsilon^3 / 3}$

For $n = 10000$ , $\epsilon = 0.1$ , $\delta = 0.01$ : $k \geq \frac{4 \cdot \ln(10^6)}{0.005 - 0.000333} \approx 2200$ . This is the theoretical minimum; in practice, $k = 128$ or $256$ works well for many ML tasks because the embeddings are not worst-case distributed.

Count-Min Sketch

Frequency Estimation for Streaming Data

The Count-Min Sketch answers: "approximately how many times has element $x$ appeared in this data stream?" with $O(1/\epsilon \cdot \log(1/\delta))$ space, compared to $O(n)$ for a hash map with all elements.

Structure: A $d \times w$ array of counters, plus $d$ independent hash functions $h_1, \ldots, h_d: U \to [w]$ .

Update: For element $x$ , increment sketch[i][h_i(x)] for each row $i$ .

Query: For element $x$ , return $\min_i \text{sketch}[i][h_i(x)]$ .

The minimum gives an upper bound on the true frequency. Hash collisions only add to counts (never subtract), so the minimum over $d$ hash functions gives the estimate closest to the true frequency.

Error guarantee: with probability $1 - \delta$ , the estimate is at most the true count plus $\epsilon \cdot \|a\|_1$ (total stream length), where $w = e/\epsilon$ and $d = \ln(1/\delta)$ .

import numpy as np
import hashlib
from typing import Any

class CountMinSketch:
    """
    Count-Min Sketch for approximate frequency estimation.
    Cormode & Muthukrishnan, 2003.

    Space: O(epsilon^{-1} * log(delta^{-1})) - independent of stream length
    Update time: O(d) = O(log(1/delta))
    Query time: O(d)

    ML applications:
    - Token frequency estimation in LLM training data
    - Feature frequency for feature selection at scale
    - Approximate join frequency for data deduplication
    - Bot/anomaly detection: count user action frequencies in real-time
    """

    def __init__(self, epsilon: float = 0.01, delta: float = 0.001):
        """
        epsilon: maximum additive error as fraction of total stream mass
        delta: probability of exceeding epsilon error
        """
        # Width: ceil(e / epsilon), Depth: ceil(ln(1/delta))
        self.width = int(np.ceil(np.e / epsilon))
        self.depth = int(np.ceil(np.log(1.0 / delta)))
        self.sketch = np.zeros((self.depth, self.width), dtype=np.int64)

        # Generate depth independent hash seeds
        self.seeds = np.random.randint(0, 2**31, size=self.depth)

        print(f"CMS size: {self.depth} x {self.width} = "
              f"{self.depth * self.width * 8 / 1024:.1f} KB")

    def _hash(self, item: Any, seed: int) -> int:
        """Deterministic hash using SHA-256 with seed mixing."""
        key = f"{seed}:{item}".encode()
        return int(hashlib.sha256(key).hexdigest(), 16) % self.width

    def update(self, item: Any, count: int = 1) -> None:
        """Add 'count' occurrences of item to the sketch."""
        for i, seed in enumerate(self.seeds):
            col = self._hash(item, seed)
            self.sketch[i, col] += count

    def query(self, item: Any) -> int:
        """Return approximate frequency of item. Always >= true frequency."""
        estimates = []
        for i, seed in enumerate(self.seeds):
            col = self._hash(item, seed)
            estimates.append(self.sketch[i, col])
        return min(estimates)

    def merge(self, other: 'CountMinSketch') -> 'CountMinSketch':
        """Merge two sketches (for distributed stream processing)."""
        assert self.width == other.width and self.depth == other.depth
        result = CountMinSketch.__new__(CountMinSketch)
        result.width = self.width
        result.depth = self.depth
        result.seeds = self.seeds
        result.sketch = self.sketch + other.sketch
        return result


# Simulate token frequency estimation for LLM training data
def simulate_token_frequency_cms():
    np.random.seed(42)
    vocab_size = 50000   # GPT-2 vocab size
    stream_length = 10_000_000  # 10M tokens

    # True frequencies: Zipf-distributed (power law, as in real text)
    token_probs = np.arange(1, vocab_size + 1, dtype=float) ** (-1.2)
    token_probs /= token_probs.sum()
    tokens = np.random.choice(vocab_size, size=stream_length, p=token_probs)

    # Exact counts (requires O(vocab_size) memory)
    exact_counts = np.bincount(tokens, minlength=vocab_size)

    # Count-Min Sketch (requires O(1/eps * log(1/delta)) memory)
    cms = CountMinSketch(epsilon=0.001, delta=0.01)
    for token in tokens:
        cms.update(token)

    # Compare on top-100 most frequent tokens
    top_tokens = np.argsort(exact_counts)[::-1][:100]
    errors = []
    for t in top_tokens:
        exact = exact_counts[t]
        approx = cms.query(t)
        errors.append(abs(approx - exact) / exact)

    print(f"Stream length: {stream_length:,}")
    print(f"Vocab size: {vocab_size:,}")
    print(f"CMS memory: ~{cms.depth * cms.width * 8 / 1024:.0f} KB "
          f"(vs {vocab_size * 8 // 1024} KB for exact)")
    print(f"Mean relative error on top-100 tokens: {np.mean(errors):.4f}")
    print(f"Max relative error on top-100 tokens: {np.max(errors):.4f}")

simulate_token_frequency_cms()

HyperLogLog

Counting Unique Elements

HyperLogLog answers: "approximately how many distinct elements have appeared in this stream?" using only $O(\log \log n)$ bits per counter - just 12KB for cardinalities up to billions.

Intuition: Hash each element to a binary string. The probability that a hash starts with $k$ leading zeros is $2^{-k}$ . If you observe a hash with 20 leading zeros, the stream probably contained at least $2^{20} = 1M$ distinct elements. Average multiple such observations to reduce variance. HyperLogLog formalizes this with a careful use of stochastic averaging and harmonic mean estimation.

import hashlib
import math

class HyperLogLog:
    """
    HyperLogLog cardinality estimator.
    Flajolet, Fusy, Gandouet, Meunier 2007.

    Space: O(log log n) ~ 12 KB for 10^9 cardinality with ~2% error
    Error: 1.04 / sqrt(m) where m is number of registers

    ML applications:
    - Count unique users/sessions for feature computation
    - Estimate vocabulary size during LLM pretraining data analysis
    - Approximate set cardinality for dataset deduplication
    - Count distinct IPs for bot detection at stream scale
    """

    def __init__(self, precision: int = 14):
        """
        precision (b): number of bits for register index
        m = 2^b registers
        Standard error: 1.04 / sqrt(m)
        b=14 -> m=16384 registers -> error ~0.81% -> ~200KB
        b=12 -> m=4096 registers  -> error ~1.6%  -> ~48KB
        """
        self.b = precision
        self.m = 1 << precision         # 2^b registers
        self.registers = [0] * self.m

        # Bias correction constant
        if self.m == 16:
            self.alpha = 0.673
        elif self.m == 32:
            self.alpha = 0.697
        elif self.m == 64:
            self.alpha = 0.709
        else:
            self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def _hash(self, item) -> int:
        """32-bit hash of item."""
        key = str(item).encode()
        return int(hashlib.md5(key).hexdigest()[:8], 16)

    def add(self, item) -> None:
        """Add item to the HyperLogLog."""
        h = self._hash(item)

        # Use first b bits as register index
        register_idx = h >> (32 - self.b)

        # Remaining 32-b bits: count trailing zeros in w
        w = h & ((1 << (32 - self.b)) - 1)  # lower 32-b bits

        # rho(w): position of leftmost 1-bit (1-indexed)
        # = leading zeros + 1
        remaining_bits = 32 - self.b
        if w == 0:
            rho = remaining_bits + 1
        else:
            rho = remaining_bits - int(math.log2(w))

        # Update register to maximum rho seen
        self.registers[register_idx] = max(self.registers[register_idx], rho)

    def count(self) -> int:
        """Estimate cardinality using harmonic mean of 2^registers."""
        # Raw estimate: harmonic mean formula
        Z = sum(2.0 ** (-r) for r in self.registers)
        E = self.alpha * (self.m ** 2) / Z

        # Small range correction (linear counting for small cardinalities)
        if E <= 2.5 * self.m:
            V = self.registers.count(0)
            if V > 0:
                E = self.m * math.log(self.m / V)

        # Large range correction (not needed for 32-bit hash up to ~2^30)
        return int(E)

    def merge(self, other: 'HyperLogLog') -> 'HyperLogLog':
        """Merge two HyperLogLogs (for distributed computation)."""
        assert self.b == other.b
        result = HyperLogLog(self.b)
        result.registers = [max(a, b)
                            for a, b in zip(self.registers, other.registers)]
        return result


# Test: count unique tokens in a simulated text corpus
def test_hyperloglog():
    hll = HyperLogLog(precision=12)  # ~48KB

    # Simulate 1 million tokens with 50000 unique values
    n_tokens = 1_000_000
    vocab_size = 50_000
    tokens = np.random.randint(0, vocab_size, size=n_tokens)

    for token in tokens:
        hll.add(token)

    estimated = hll.count()
    actual = vocab_size  # we know the true cardinality

    error = abs(estimated - actual) / actual
    print(f"Actual unique tokens: {actual:,}")
    print(f"HLL estimate: {estimated:,}")
    print(f"Relative error: {error:.2%}")
    print(f"Memory used: ~{hll.m * 1 // 1024} KB "
          f"(vs {vocab_size * 8 // 1024} KB for exact set)")

test_hyperloglog()

Randomized SVD

Why Full SVD Is Too Expensive

Singular Value Decomposition $A = U \Sigma V^T$ is the foundation of PCA, matrix factorization, and LSA (latent semantic analysis). For a matrix of size $m \times n$ , full SVD costs $O(\min(mn^2, m^2 n))$ time. For a TF-IDF matrix with 1 million documents and 100,000 terms, that is approximately $10^{16}$ operations - completely infeasible.

Randomized SVD (Halko, Martinsson, Tropp, 2011) computes an approximate rank- $k$ factorization in $O(mnk)$ time. The algorithm:

Draw a random Gaussian matrix $\Omega \in \mathbb{R}^{n \times k}$
Form $Y = A\Omega$ - project $A$ onto a random subspace
Optionally apply power iteration: $Y = (AA^T)^q Y$ for better approximation
Compute QR factorization: $Q, R = \text{QR}(Y)$ to get an orthonormal basis for the range of $A$
Project: $B = Q^T A$ , then compute exact SVD of the small matrix $B$
Recover: $U = Q \cdot U_B$

import numpy as np
from scipy.linalg import svd as scipy_svd
import time

def randomized_svd(A: np.ndarray, k: int,
                   n_power_iter: int = 4,
                   n_oversampling: int = 10) -> tuple:
    """
    Randomized SVD: approximate top-k singular values/vectors.
    Halko, Martinsson, Tropp (2011).

    A: [m, n] input matrix
    k: number of singular components to approximate
    n_power_iter: power iteration steps (improves approximation quality)
    n_oversampling: extra dimensions for numerical stability

    Returns: (U, S, Vt) where A ~= U @ diag(S) @ Vt

    ML applications:
    - PCA on large feature matrices (millions of samples)
    - LSA on large TF-IDF matrices
    - Matrix completion for recommendation systems
    - Compressing embedding weight matrices in neural networks
    """
    m, n = A.shape
    l = k + n_oversampling  # slightly oversampled dimension

    # Step 1: Random projection
    # Omega: [n, l] random Gaussian matrix
    Omega = np.random.randn(n, l)

    # Step 2: Form Y = A @ Omega, then power iteration
    # Power iteration improves accuracy for matrices with slow singular value decay
    Y = A @ Omega  # [m, l]

    for _ in range(n_power_iter):
        # Y = (A A^T)^q Y  improves gap between kth and (k+1)th singular values
        Q, _ = np.linalg.qr(Y)
        Z = A.T @ Q
        Q2, _ = np.linalg.qr(Z)
        Y = A @ Q2

    # Step 3: QR factorization to get orthonormal basis Q
    Q, _ = np.linalg.qr(Y)  # Q: [m, l]

    # Step 4: Project A to low-dimensional space
    B = Q.T @ A              # B: [l, n]

    # Step 5: Exact SVD of small matrix B
    U_hat, S, Vt = scipy_svd(B, full_matrices=False)

    # Step 6: Recover U in original space
    U = Q @ U_hat            # U: [m, l]

    return U[:, :k], S[:k], Vt[:k, :]


def compare_svd_methods():
    """Compare full SVD vs randomized SVD on a realistic matrix."""
    np.random.seed(42)

    # Simulate document-term matrix: 50k docs, 20k terms, rank ~100
    m, n, true_rank = 5000, 2000, 100
    U_true = np.random.randn(m, true_rank)
    S_true = np.linspace(100, 1, true_rank)
    V_true = np.random.randn(n, true_rank)
    noise = 0.01 * np.random.randn(m, n)
    A = (U_true * S_true) @ V_true.T + noise

    k = 50  # approximate with top-50 components

    # Full SVD
    t0 = time.time()
    U_full, S_full, Vt_full = scipy_svd(A, full_matrices=False)
    t_full = time.time() - t0

    # Randomized SVD
    t0 = time.time()
    U_rand, S_rand, Vt_rand = randomized_svd(A, k=k, n_power_iter=4)
    t_rand = time.time() - t0

    # Reconstruction error
    A_full = U_full[:, :k] @ np.diag(S_full[:k]) @ Vt_full[:k, :]
    A_rand = U_rand @ np.diag(S_rand) @ Vt_rand

    err_full = np.linalg.norm(A - A_full, 'fro') / np.linalg.norm(A, 'fro')
    err_rand = np.linalg.norm(A - A_rand, 'fro') / np.linalg.norm(A, 'fro')

    print(f"Matrix shape: {A.shape}")
    print(f"Full SVD:  time={t_full:.3f}s, reconstruction error={err_full:.6f}")
    print(f"Rand SVD:  time={t_rand:.3f}s, reconstruction error={err_rand:.6f}")
    print(f"Speedup: {t_full / t_rand:.1f}x")

compare_svd_methods()

Locality-Sensitive Hashing (LSH)

Approximate Nearest Neighbor Search

Finding the exact nearest neighbor of a query vector in a dataset of $n$ vectors requires $O(nd)$ time - you must compare the query to every vector. For $n = 10^8$ and $d = 768$ (BERT embedding size), that is $7.6 \times 10^{10}$ operations per query. At 10 GFLOPS, that is 7.6 seconds per query. Completely unacceptable for real-time retrieval.

LSH provides approximate nearest neighbor (ANN) search: find a vector that is close to (not necessarily the exact closest to) the query, in sub-linear time.

For cosine similarity (random projections LSH):

Hash $x$ using $k$ random hyperplanes: $h_i(x) = \text{sign}(r_i \cdot x)$ where $r_i \sim \mathcal{N}(0, I)$
Two vectors $x, y$ collide in hash $i$ with probability $1 - \theta(x,y)/\pi$ where $\theta$ is the angle between them
Close vectors (small angle) collide with high probability; distant vectors (large angle) rarely collide
Use $L$ independent hash tables with $k$ bits each for high recall

import numpy as np
from typing import List, Dict, Set

class RandomProjectionLSH:
    """
    LSH for cosine similarity using random hyperplane projections.
    Charikar (2002): P[h(x) = h(y)] = 1 - angle(x,y)/pi

    Used in: Faiss flat L2 index, approximate kNN for recommendation,
    semantic deduplication of training data, similar document retrieval.
    """

    def __init__(self, d: int, n_bits: int = 16, n_tables: int = 8):
        """
        d: input dimension
        n_bits: bits per hash (controls recall-precision tradeoff)
        n_tables: number of hash tables (higher = more recall, more memory)
        """
        self.d = d
        self.n_bits = n_bits
        self.n_tables = n_tables

        # Random hyperplanes: [n_tables, n_bits, d]
        self.hyperplanes = np.random.randn(n_tables, n_bits, d)

        # Hash tables: list of {hash_code: [item_ids]}
        self.tables: List[Dict[int, List[int]]] = [{} for _ in range(n_tables)]
        self.vectors = []  # stored vectors for exact comparison

    def _hash(self, v: np.ndarray, table_idx: int) -> int:
        """Compute n_bits hash for vector v using table table_idx."""
        projections = self.hyperplanes[table_idx] @ v  # [n_bits]
        bits = (projections > 0).astype(int)
        return int(''.join(map(str, bits)), 2)

    def index(self, vectors: np.ndarray) -> None:
        """Add vectors to LSH index."""
        # Normalize for cosine similarity
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        self.vectors = vectors / (norms + 1e-10)

        for i, v in enumerate(self.vectors):
            for t in range(self.n_tables):
                code = self._hash(v, t)
                if code not in self.tables[t]:
                    self.tables[t][code] = []
                self.tables[t][code].append(i)

    def query(self, q: np.ndarray, top_k: int = 10) -> List[int]:
        """Find approximate nearest neighbors to query q."""
        q_norm = q / (np.linalg.norm(q) + 1e-10)

        # Collect candidates from all hash tables
        candidates: Set[int] = set()
        for t in range(self.n_tables):
            code = self._hash(q_norm, t)
            candidates.update(self.tables[t].get(code, []))

        if not candidates:
            return []

        # Re-rank candidates by exact cosine similarity
        candidate_list = list(candidates)
        candidate_vecs = self.vectors[candidate_list]
        scores = candidate_vecs @ q_norm
        top_indices = np.argsort(scores)[::-1][:top_k]

        return [candidate_list[i] for i in top_indices]


def benchmark_lsh_vs_brute_force():
    """Compare LSH recall and speed against brute-force kNN."""
    np.random.seed(42)
    n, d = 100_000, 256

    # Random unit vectors (simulated embeddings)
    X = np.random.randn(n, d)
    X /= np.linalg.norm(X, axis=1, keepdims=True)

    # Build LSH index
    lsh = RandomProjectionLSH(d=d, n_bits=16, n_tables=10)
    lsh.index(X)

    # Test queries
    n_queries = 100
    queries = np.random.randn(n_queries, d)
    queries /= np.linalg.norm(queries, axis=1, keepdims=True)

    # Brute-force ground truth
    bf_results = (queries @ X.T).argsort(axis=1)[:, ::-1][:, :10]

    # LSH results
    lsh_results = [set(lsh.query(q, top_k=10)) for q in queries]
    bf_sets = [set(bf_results[i]) for i in range(n_queries)]

    # Recall@10
    recalls = [len(lsh_results[i] & bf_sets[i]) / 10
               for i in range(n_queries)]
    print(f"n={n:,}, d={d}")
    print(f"LSH Recall@10: {np.mean(recalls):.2%}")
    print(f"Hash tables: 10, bits per hash: 16")

benchmark_lsh_vs_brute_force()

Random Feature Maps for Kernel Approximation

Random Fourier Features (Rahimi and Recht, 2007) approximate an RBF kernel $k(x, y) = e^{-\|x-y\|^2 / 2\sigma^2}$ using a random feature map:

$k(x, y) \approx \phi(x)^T \phi(y)$

where $\phi(x) = \frac{1}{\sqrt{D}} [\cos(\omega_1 x + b_1), \ldots, \cos(\omega_D x + b_D)]$ with $\omega_i \sim \mathcal{N}(0, 1/\sigma^2)$ and $b_i \sim \text{Uniform}(0, 2\pi)$ .

This allows training a kernel SVM with $O(nD)$ memory instead of $O(n^2)$ for the full kernel matrix.

class RandomFourierFeatures:
    """
    Random Fourier Features for RBF kernel approximation.
    Rahimi & Recht, NeurIPS 2007.

    Enables linear SVMs/logistic regression to approximate
    kernel SVMs at O(nD) cost vs O(n^2) for full kernel matrix.
    Used in: large-scale kernel methods, neural tangent kernel approximation,
    fast attention mechanisms (Performer, Random Feature Attention).
    """

    def __init__(self, input_dim: int, n_features: int = 1000,
                 gamma: float = 1.0):
        """
        input_dim: original feature dimension
        n_features: number of random features D (higher = better approximation)
        gamma: RBF kernel bandwidth (same as sklearn's gamma parameter)
        """
        self.D = n_features
        # Sample frequencies from the Fourier transform of the RBF kernel
        # For RBF: p(omega) = N(0, 2*gamma * I)
        self.omega = np.random.randn(input_dim, n_features) * np.sqrt(2 * gamma)
        self.bias = np.random.uniform(0, 2 * np.pi, size=n_features)

    def transform(self, X: np.ndarray) -> np.ndarray:
        """Map X to random feature space."""
        # phi(x) = sqrt(2/D) * cos(x @ omega + bias)
        projection = X @ self.omega + self.bias  # [n, D]
        return np.sqrt(2.0 / self.D) * np.cos(projection)


def compare_kernel_approximation():
    """Show that RFF approximates RBF kernel well."""
    from sklearn.metrics.pairwise import rbf_kernel

    np.random.seed(42)
    n, d = 500, 50
    X = np.random.randn(n, d)

    gamma = 0.1
    K_exact = rbf_kernel(X, gamma=gamma)

    rff = RandomFourierFeatures(d, n_features=2000, gamma=gamma)
    Phi = rff.transform(X)
    K_approx = Phi @ Phi.T

    error = np.abs(K_exact - K_approx).mean()
    print(f"Mean absolute kernel approximation error: {error:.6f}")
    print(f"Max absolute error: {np.abs(K_exact - K_approx).max():.6f}")
    print(f"Memory: exact kernel = {n*n*8//1024} KB, "
          f"RFF = {n*2000*8//1024} KB for {n} samples")

compare_kernel_approximation()

Monte Carlo Methods

Monte Carlo methods use random sampling to solve problems that are computationally intractable by exact methods. In ML, the most important applications are:

Monte Carlo Integration: Estimate $\int f(x) dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i)$ where $x_i \sim p(x)$ . Converges at $O(1/\sqrt{N})$ regardless of dimension - the only algorithm that does not suffer from the curse of dimensionality for integration.

Dropout as Monte Carlo: At inference time, running $T$ forward passes with different dropout masks and averaging predictions is Monte Carlo integration over the posterior distribution of network weights (Gal and Ghahramani, 2016). This gives uncertainty estimates from standard neural networks.

Monte Carlo Tree Search: Used in AlphaGo - sample possible game continuations to estimate position values without exhaustive search.

def monte_carlo_dropout_uncertainty(
    model: 'nn.Module',
    x: 'torch.Tensor',
    n_samples: int = 100
) -> tuple:
    """
    Monte Carlo Dropout for predictive uncertainty estimation.
    Gal & Ghahramani, ICML 2016.

    Run model in train mode (dropout active) n_samples times.
    Mean = prediction. Variance = epistemic uncertainty.

    Returns: (mean_prediction, uncertainty)
    """
    import torch
    model.train()  # keep dropout active for MC sampling
    predictions = []

    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(torch.softmax(pred, dim=-1))

    preds = torch.stack(predictions, dim=0)   # [n_samples, batch, n_classes]
    mean_pred = preds.mean(dim=0)              # [batch, n_classes]
    uncertainty = preds.var(dim=0).sum(dim=-1) # total predictive variance [batch]

    return mean_pred, uncertainty

Randomized SVD in Production

Production Engineering Notes

When to Use Sketching vs Exact Computation

The decision to use approximate methods should be data-driven. Document the accuracy tradeoff explicitly:

def choose_cardinality_method(n_unique_estimate: int,
                               memory_budget_mb: float,
                               error_tolerance: float) -> str:
    """
    Decision guide for cardinality estimation method selection.
    """
    exact_memory_mb = n_unique_estimate * 8 / (1024 * 1024)

    if exact_memory_mb <= memory_budget_mb:
        return f"Use exact set: {exact_memory_mb:.1f} MB fits in budget"
    elif error_tolerance >= 0.02:
        hll_memory_kb = 12  # standard HyperLogLog
        return f"Use HyperLogLog: ~{hll_memory_kb}KB, ~2% error"
    elif error_tolerance >= 0.005:
        hll_memory_kb = 48  # precision=12
        return f"Use HyperLogLog (higher precision): ~{hll_memory_kb}KB, ~0.8% error"
    else:
        return "Exact computation required: error tolerance too tight for sketching"

Combining Sketches for Distributed ML

One of the most powerful properties of sketches is mergeability: you can compute sketches independently on different data shards and merge them. This enables distributed feature computation:

# Pattern: distributed Count-Min Sketch merge
# Worker 1 processes shard 1, Worker 2 processes shard 2
# Master merges sketches - no raw data transfer needed
def distributed_frequency_estimation(shards: list) -> dict:
    """
    Each worker builds a CMS on its shard.
    Master merges by element-wise addition.
    Memory transfer: O(1/eps * log(1/delta)) per worker (tiny)
    vs O(n) raw data per worker (huge).
    """
    local_sketches = []
    for shard in shards:
        cms = CountMinSketch(epsilon=0.001, delta=0.01)
        for item in shard:
            cms.update(item)
        local_sketches.append(cms)

    # Merge: element-wise max for HLL, element-wise sum for CMS
    merged = local_sketches[0]
    for sketch in local_sketches[1:]:
        merged = merged.merge(sketch)

    return merged

warning

JL Projection Preserves Distances, Not Angles

The Johnson-Lindenstrauss lemma preserves Euclidean distances. If your similarity metric is cosine similarity (angle between vectors), you need to normalize vectors to unit length before projecting, or use random hyperplane projections (which directly approximate cosine similarity). Using JL projections on unnormalized vectors for cosine similarity will give wrong results.

danger

Count-Min Sketch Always Over-Counts, Never Under-Counts

Hash collisions in CMS only add to counts - they never subtract. This means CMS estimates are always >= the true frequency. For rare items with true frequency near zero, the estimate can be dominated by collision noise from high-frequency items. If you need accurate frequency estimates for rare events (tail of the distribution), CMS is the wrong tool. Use exact counting for the long tail and CMS only for aggregate frequency estimation.

Interview Questions and Answers

Q1: Explain reservoir sampling. Why does the replacement step with probability k/i maintain uniformity?

Reservoir sampling maintains the invariant that after processing $i$ elements, each element has been selected with probability $k/i$ . The proof is by induction.

Base case: after processing $k$ elements, each is in the reservoir with probability $k/k = 1$ . Correct.

Inductive step: assume after $i-1$ elements, each is in the reservoir with probability $k/(i-1)$ . When element $i$ arrives, we include it with probability $k/i$ . For any previously seen element $j$ : it is in the reservoir before step $i$ with probability $k/(i-1)$ . Given it is in the reservoir, it stays with probability $(1 - \frac{k/i}{k}) = 1 - 1/i = (i-1)/i$ (probability that a random reservoir slot is NOT chosen for replacement). So: $P[\text{j in reservoir after step i}] = \frac{k}{i-1} \cdot \frac{i-1}{i} = k/i$ . The new element is included with probability $k/i$ . All elements equally likely.

Q2: What is the Johnson-Lindenstrauss lemma and why does it matter for ML?

JL lemma: for any $n$ points in $\mathbb{R}^d$ and error tolerance $\epsilon \in (0,1)$ , a random linear projection to $k = O(\log n / \epsilon^2)$ dimensions preserves all pairwise distances within factor $(1 \pm \epsilon)$ with high probability. The projection matrix has i.i.d. Gaussian entries scaled by $1/\sqrt{k}$ .

Why it matters for ML: (1) dimensionality reduction for ANN search - reduces BERT's 768D embeddings to 128D with controlled distortion, enabling faster indexing; (2) theoretical justification for why random feature hashing in neural networks preserves similarity; (3) kernel approximation via random features - the RFF method (Rahimi & Recht) uses JL-style random projections to approximate kernel matrices; (4) privacy-preserving ML - random projections can hide original features while preserving structure. The remarkable fact is that you do not need to optimize the projection - pure randomness is sufficient.

Q3: How does the Count-Min Sketch work, and what are its error guarantees?

CMS uses $d$ hash tables, each with $w$ counters and an independent hash function. On update, increment one counter in each table. On query, return the minimum across tables.

Guarantees: with probability $1 - \delta$ , the estimate exceeds the true count by at most $\epsilon \cdot N$ (where $N$ is total stream weight), using $w = \lceil e/\epsilon \rceil$ and $d = \lceil \ln(1/\delta) \rceil$ counters.

Why minimum works: each hash function maps the element to one counter, which accumulates the true count plus noise from hash collisions. Taking the minimum gives the estimate with the least collision noise. Since collisions only increase counts (never decrease), the estimate is always an upper bound.

For token frequency in LLM training: with $\epsilon = 0.001$ and $\delta = 0.01$ : $w \approx 2719$ , $d \approx 5$ . Total memory: 13,595 counters x 8 bytes = 108 KB. This can track frequency for an unlimited vocabulary versus exact counting that requires 50,000 x 8 = 400 KB just for the counter array (not counting the hash map overhead).

Q4: What is randomized SVD and when should you use it instead of full SVD?

Randomized SVD (Halko, Martinsson, Tropp 2011) approximates the top- $k$ singular value decomposition of a matrix in $O(mnk)$ time by: (1) projecting the matrix onto a random low-dimensional subspace via a Gaussian sketch, (2) refining via power iteration, (3) running exact SVD on the small projected matrix, (4) recovering singular vectors in the original space.

Use randomized SVD when: the matrix is too large for full SVD (cost $O(mn^2)$ ), you only need the top- $k$ components (which is usually the case for PCA and LSA), and an approximation error is acceptable. Sklearn's TruncatedSVD uses this algorithm.

Rule of thumb: if $k < \min(m,n)/4$ and the matrix has a good low-rank approximation (fast singular value decay), randomized SVD is 10-100x faster with near-identical reconstruction error (within $1-5\%$ of exact). For full-rank matrices with slow singular value decay, power iteration helps but full SVD may be necessary.

Q5: Explain LSH for approximate nearest neighbor search. What is the tradeoff between recall and speed?

LSH exploits the locality-sensitive property of random hyperplane projections: two vectors $x, y$ collide in hash $h(x) = \text{sign}(r \cdot x)$ with probability equal to $1 - \theta(x,y)/\pi$ where $\theta$ is their angle. Close vectors (small angle) collide with high probability; distant vectors rarely collide.

With $k$ bits per hash table: vectors must match on all $k$ bits to collide. The collision probability for similar vectors is $(1 - \theta/\pi)^k$ - this falls off quickly, so mostly only truly similar vectors share hash buckets. With $L$ independent tables: query returns candidates that collide in ANY table. Recall = $1 - (1 - (1-\theta/\pi)^k)^L$ .

The tradeoff: more hash tables $L$ increases recall (more candidates found) at the cost of speed (more buckets to check). More bits $k$ increases precision (fewer false positives) at the cost of recall (fewer true positives pass the strict bit-match). In production: Faiss's IVFPQ uses LSH-like partitioning for billion-scale ANN, tuning $L$ and $k$ empirically to hit a target recall-latency curve. A common operating point is 95% recall@10 at 2-5ms latency for 100M vectors.

Q6: SGD is described as a randomized algorithm. What does this mean, and how does it connect to sketching?

Standard gradient descent computes the exact gradient over the full dataset: $g = \nabla_\theta \frac{1}{n} \sum_{i=1}^n L(f_\theta(x_i), y_i)$ . This is exact but costs $O(n)$ per step.

SGD approximates this with a mini-batch of size $b$ : $\hat{g} = \nabla_\theta \frac{1}{b} \sum_{i \in \mathcal{B}} L(f_\theta(x_i), y_i)$ . The mini-batch is a random sample - a "sketch" of the gradient. The expectation is correct ( $\mathbb{E}[\hat{g}] = g$ ), but there is variance that decreases as $O(1/b)$ .

The connection to sketching: just as Count-Min Sketch approximates exact counts with provable error bounds, SGD approximates exact gradients with provable convergence. For convex functions, SGD with step size $\eta$ converges at $O(1/\sqrt{T})$ . For non-convex functions (neural networks), convergence to stationary points holds under mild assumptions on gradient noise.

The practical implication: the noise from random mini-batches is not just a computational convenience. It acts as regularization (helps escape sharp minima) and is why models trained with smaller batch sizes often generalize better - the noisy gradient landscape forces the optimizer to find wider, more robust minima.

Summary

Randomized algorithms are not approximations of last resort - they are optimal algorithms for many ML workloads. Reservoir sampling is the only correct way to maintain a uniform sample from an unbounded stream. HyperLogLog is the practical solution for any "count distinct" problem over large streams. Count-Min Sketch is the right tool for frequency estimation when memory is constrained. Randomized SVD is the standard for PCA on large matrices.

The unifying theme is the tradeoff between exactness and efficiency, formalized as probabilistic guarantees: error at most $\epsilon$ with probability at least $1 - \delta$ . Every algorithm in this lesson has this form, and the parameters $\epsilon$ and $\delta$ are tunable to match your application's requirements.

Understanding these guarantees matters as much as knowing the algorithms. When you deploy a Count-Min Sketch in a production feature store, you need to know that it over-counts, never under-counts, and the error grows with the total stream volume - not just with the frequency of the specific item. When you use randomized SVD, you need to know that power iteration is critical for matrices without rapid singular value decay. These are the practical details that separate correct use from cargo-culting.

The 3-Petabyte Feature Store Problem​

Why This Exists​

Historical Context​

Reservoir Sampling​

The Streaming Data Problem​

Random Projections and Johnson-Lindenstrauss​

The Curse of Dimensionality and a Remedy​

JL Lemma in Practice​

Count-Min Sketch​

Frequency Estimation for Streaming Data​

HyperLogLog​

Counting Unique Elements​

Randomized SVD​

Why Full SVD Is Too Expensive​

Locality-Sensitive Hashing (LSH)​

Approximate Nearest Neighbor Search​

Random Feature Maps for Kernel Approximation​

Monte Carlo Methods​

Randomized SVD in Production​

Production Engineering Notes​

When to Use Sketching vs Exact Computation​

Combining Sketches for Distributed ML​

Interview Questions and Answers​

Summary​