:::tip ๐ฎ Interactive Playground Visualize this concept: Try the ANN Algorithms demo on the EngineersOfAI Playground - no code required. :::
Approximate Nearest Neighbor Algorithms
The Scaling Problemโ
The team had been running FAISS IndexFlatL2 in production for six months, and it worked fine - until it did not. With 1 million vectors at 768 dimensions, search latency was a comfortable 45ms. At 10 million, it climbed to 450ms. At 100 million, the math was clear: flat search would require 4.5 seconds per query, and the index would not even fit in RAM on a single machine.
The engineering team escalated: they needed sub-100ms search at 100M+ vectors. Someone recommended "just switch to HNSW." The engineer pulled up the FAISS documentation, saw parameters like M=16, efConstruction=200, efSearch=50, and had no mental model for what any of these controlled or what values to set. They set M=32 (it sounded bigger, therefore better) and efSearch=10 (it sounded faster), and shipped it. Recall dropped from 1.0 to 0.67. Users started complaining that the search was getting worse, not better.
This is a solvable problem, but only if you understand what HNSW is actually doing, what M and efSearch control, and how to reason about the recall-vs-latency tradeoff before you set any parameters. This lesson gives you that understanding for every major ANN algorithm: HNSW, IVF, product quantization, IVFPQ, LSH, and DiskANN.
After this lesson, you will be able to look at a dataset with given size, dimensionality, and update frequency, reason through which algorithm class fits, and tune the key parameters with a principled approach rather than guessing.
Why Exact Search Fails at Scaleโ
The fundamental problem is that exact nearest neighbor search is per query, where is the number of vectors and is dimensionality. This scales linearly with dataset size.
For and : that is floating-point multiply-accumulate operations per query. Even at 10 TFLOPS, that is 7.68 seconds per query. No caching, batching, or SIMD vectorization changes the asymptotic scaling - you are paying full price for every additional document in the corpus.
ANN algorithms escape this trap by sacrificing exactness. Rather than guaranteeing the true nearest neighbor, they guarantee finding it with high probability (typically 95โ99%) while reducing the effective search to a small fraction of the full dataset. The savings are not marginal: well-tuned HNSW at 100M vectors achieves 1000+ queries per second with recall@10 above 0.95 on a single machine.
Algorithm 1: HNSW - Hierarchical Navigable Small Worldโ
HNSW (Malkov and Yashunin, 2018) is the dominant algorithm for in-memory ANN search. It is the default index in Qdrant, the primary index in Weaviate, and available in FAISS. Understanding HNSW is the most important thing you can do in this space.
The Intuition: A Navigational Highway Systemโ
Think of HNSW as a multi-layer road network. At the highest layer (layer 2), you have a few highways connecting distant nodes. At the middle layer (layer 1), you have more roads with shorter connections. At the base layer (layer 0), every vector is a node with many short-range connections to its actual neighbors.
To find the nearest neighbor of a query:
- Enter the graph at the top layer at a fixed entry node
- Greedily traverse to the closest node at that layer
- Descend to the next layer and repeat
- At the base layer, explore the local neighborhood thoroughly
The hierarchy allows the algorithm to navigate from "somewhere in the right general region" to "exactly the right local neighborhood" in hops instead of comparisons.
The Math Behind the Structureโ
At index construction, each new vector is inserted at a randomly chosen maximum layer , drawn from an exponential distribution:
where and is the target number of connections per node. This exponential distribution ensures most vectors are only in the base layer, with exponentially fewer vectors at each higher layer - exactly like a skip list.
At each layer where the node is inserted, it connects to its nearest neighbors (or for the base layer, often set to ).
The Two Key Parametersโ
M (default: 16) - the number of bidirectional links created per node at insertion. Higher M means:
- Better recall (more connections, richer graph)
- Larger memory footprint ( per layer for the graph structure)
- Slower index construction
- Rule of thumb: M=8โ16 for most use cases; M=32โ64 for maximum recall on high-dimensional data
efSearch (default: 50) - the size of the dynamic candidate list during search. Higher efSearch means:
- Better recall (explores more candidate nodes)
- Slower query time (more distance computations)
- This is the primary runtime tuning knob - adjust this without rebuilding the index
efConstruction (default: 200) - the ef parameter used during index building. Higher efConstruction:
- Better index quality (more careful insertion)
- Slower index building (not query time)
- Typically set 2โ4ร higher than your target efSearch
import faiss
import numpy as np
import time
def build_hnsw_index(
vectors: np.ndarray,
M: int = 16,
ef_construction: int = 200,
) -> faiss.IndexHNSWFlat:
"""
Build FAISS HNSW index. Note: FAISS HNSW uses L2 distance internally.
For cosine similarity, normalize vectors before indexing.
"""
d = vectors.shape[1]
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = ef_construction
# FAISS HNSW does not support incremental add well - add all at once
index.add(vectors.astype(np.float32))
return index
def search_hnsw(
index: faiss.IndexHNSWFlat,
query: np.ndarray,
k: int = 10,
ef_search: int = 50,
) -> tuple:
"""Search HNSW with configurable efSearch."""
index.hnsw.efSearch = ef_search
distances, indices = index.search(
query.reshape(1, -1).astype(np.float32), k
)
return indices[0], distances[0]
def tune_ef_search(
index: faiss.IndexHNSWFlat,
query_vectors: np.ndarray,
true_neighbors: list,
k: int = 10,
) -> None:
"""
Find the efSearch value that hits your recall target at minimum latency.
Run this during load testing before going live.
"""
ef_values = [10, 20, 40, 80, 120, 200, 400]
print(f"{'efSearch':>10} | {'Recall@10':>10} | {'P95 Latency (ms)':>18}")
print("-" * 45)
for ef in ef_values:
results = []
latencies = []
for q in query_vectors[:200]:
t0 = time.perf_counter()
idx, _ = search_hnsw(index, q, k=k, ef_search=ef)
latencies.append((time.perf_counter() - t0) * 1000)
results.append(list(idx))
# Compute recall
recalls = []
for true, pred in zip(true_neighbors[:200], results):
true_set = set(true[:k])
pred_set = set(pred[:k])
recalls.append(len(true_set & pred_set) / k)
p95 = float(np.percentile(latencies, 95))
recall = float(np.mean(recalls))
print(f"{ef:>10} | {recall:>10.4f} | {p95:>18.2f}")
Algorithm 2: IVF - Inverted File Indexโ
IVF (Johnson et al., FAISS paper, 2017) takes a different approach: partition the vector space into clusters, and only search the clusters nearest to the query.
How IVF Worksโ
- Training: Run k-means clustering on a representative sample of the dataset to create centroids (cluster centers)
- Indexing: Assign each vector to its nearest centroid; store an inverted list of vectors per centroid
- Search: Find the nearest centroids to the query; search only the vectors in those clusters
If you have 10,000 clusters (nlist=10000) and search 50 of them (nprobe=50), you only compute distances for roughly 0.5% of the dataset. That is a 200ร speedup with recall depending on how well-clustered the data is.
IVF Parametersโ
nlist - number of clusters. Rule of thumb: for small datasets, for better recall. For 10M vectors: nlist = 4000โ16000. Training requires at least 30โ50 vectors per centroid, so you need a representative training set.
nprobe - number of clusters to search at query time. This is the runtime tradeoff knob (like efSearch in HNSW):
- nprobe=1: search only the nearest cluster (fast, low recall)
- nprobe=nlist: search all clusters (exact search, slow)
- Typical target: nprobe where recall@10 >= 0.95
def build_ivf_index(
vectors: np.ndarray,
nlist: int = 4096,
metric: int = faiss.METRIC_L2,
) -> faiss.IndexIVFFlat:
"""
Build IVF flat index. Requires a training step on a representative sample.
"""
d = vectors.shape[1]
quantizer = faiss.IndexFlatL2(d) # coarse quantizer
index = faiss.IndexIVFFlat(quantizer, d, nlist, metric)
# Train on a representative sample
# Need at least 30 * nlist vectors for good centroid quality
n_train = min(len(vectors), max(50 * nlist, 100_000))
train_sample = vectors[:n_train].astype(np.float32)
print(f"Training IVF with {n_train} vectors, {nlist} clusters...")
index.train(train_sample)
# Add all vectors
index.add(vectors.astype(np.float32))
return index
def search_ivf(
index: faiss.IndexIVFFlat,
query: np.ndarray,
k: int = 10,
nprobe: int = 64,
) -> tuple:
index.nprobe = nprobe
distances, indices = index.search(
query.reshape(1, -1).astype(np.float32), k
)
return indices[0], distances[0]
Algorithm 3: Product Quantizationโ
Product Quantization (PQ) is a compression technique that reduces memory footprint by 8โ64ร at the cost of some recall. It is often combined with IVF to create IVFPQ, the workhorse for billion-scale search.
The Compression Ideaโ
A 768-dimensional float32 vector takes 768 ร 4 = 3072 bytes. For 100M vectors, that is 307 GB. PQ compresses vectors by:
- Splitting the vector into sub-vectors of dimension
- Independently quantizing each sub-vector to one of centroids (typically 256, fitting in 1 byte)
- Storing the centroid indices instead of the raw sub-vector
With sub-vectors and centroids per sub-vector: each vector is compressed to 96 bytes (from 3072 bytes - a 32ร compression). The codebooks for distance lookup are pre-computed, so approximate distances can be computed with table lookups rather than floating-point multiply-accumulate operations.
where is the centroid assigned to the -th sub-vector of , and is the -th sub-vector of the query.
def build_pq_index(
vectors: np.ndarray,
m: int = 96, # number of sub-vectors (must divide d)
n_bits: int = 8, # bits per sub-vector (256 centroids)
) -> faiss.IndexPQ:
"""
Build PQ index. Provides ~32x compression with some recall loss.
Good for memory-constrained environments.
"""
d = vectors.shape[1]
assert d % m == 0, f"d={d} must be divisible by m={m}"
index = faiss.IndexPQ(d, m, n_bits)
index.train(vectors[:50_000].astype(np.float32))
index.add(vectors.astype(np.float32))
return index
Algorithm 4: IVFPQ - The Production Workhorseโ
IVFPQ combines IVF's coarse partitioning with PQ's compression. It is the standard approach for billion-scale vector search.
The search process:
- Find the nearest IVF centroids to the query
- For each vector in those clusters, use the pre-computed PQ distance tables to quickly estimate the true distance
- Return the top-k by approximate distance
IVFPQ at 1 billion vectors with nlist=65536, m=64, nprobe=64 requires roughly 64 GB RAM (for compressed vectors) rather than 3 TB for the raw vectors. Search QPS exceeds 1000 at recall@10 above 0.92 on modern hardware.
def build_ivfpq_index(
vectors: np.ndarray,
nlist: int = 4096,
m: int = 96,
n_bits: int = 8,
) -> faiss.IndexIVFPQ:
"""
IVFPQ: the workhorse for large-scale production search.
Combines IVF partitioning with PQ compression.
"""
d = vectors.shape[1]
assert d % m == 0, f"d={d} must be divisible by m={m}"
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, n_bits)
n_train = min(len(vectors), 50 * nlist)
index.train(vectors[:n_train].astype(np.float32))
index.add(vectors.astype(np.float32))
return index
Algorithm 5: LSH - Locality Sensitive Hashingโ
LSH (Indyk and Motwani, 1998) takes a fundamentally different approach. Instead of building a graph or clustering vectors, it uses a family of hash functions with the property that similar vectors collide (get the same hash) with higher probability than dissimilar vectors.
For cosine similarity, random hyperplane hashing works: for each hash function , draw a random vector , and set . With such hash functions, similar vectors produce identical hash signatures with probability where is the angle between them.
Why LSH has fallen out of favor for most applications: LSH requires a large number of hash tables to achieve good recall, which increases memory. It also struggles with non-uniform data distributions. HNSW at the same recall level is typically faster, uses less memory, and is simpler to configure. LSH remains relevant for streaming applications and cases where you cannot afford index construction time.
Algorithm 6: DiskANNโ
DiskANN (Subramanya et al., NeurIPS 2019, later commercialized by Microsoft/Azure) solves a specific problem: HNSW requires the entire graph to fit in RAM. At 1B vectors ร 768 dimensions ร 4 bytes = 3 TB, that does not fit in RAM on any reasonably-priced machine.
DiskANN stores most of the index on SSD and uses a smart prefetching strategy: given a node in the graph, prefetch the disk pages for its neighbors before navigating there. The graph is compressed with PQ in memory as a guide, with full precision vectors on disk for final scoring. This enables billion-scale search on machines with 64 GB RAM instead of 3 TB.
DiskANN-style storage is now available in Qdrant (on-disk HNSW) and Azure AI Search.
Algorithm Selection Frameworkโ
When update frequency mattersโ
| Update Pattern | Best Algorithm | Why |
|---|---|---|
| Mostly static | IVFPQ | Compression fine; rebuild on batch update |
| Frequent individual inserts | HNSW | Graph supports incremental insertion |
| High churn (delete + insert) | HNSW with periodic reindex | PQ index does not support efficient deletes |
| Streaming with strict SLA | LSH or ScaNN | Hash-based updates are per vector |
The Recall vs QPS Curveโ
Every ANN algorithm exposes a tradeoff between recall@K and queries per second (QPS). The primary knob is always "how much of the search space to explore":
- For HNSW: efSearch
- For IVF/IVFPQ: nprobe
The correct workflow for tuning is:
- Establish a recall target (e.g., recall@10 >= 0.95)
- Generate a test set of queries with ground truth from exact search
- Sweep the exploration parameter from low to high
- Find the minimum parameter value that hits your recall target
- Benchmark QPS at that parameter value under realistic concurrent load
Never tune parameters without measuring recall. A fast index with 0.67 recall is not a fast index - it is a broken one.
def recall_qps_sweep(index, query_vectors, true_neighbors, k=10):
"""
Generate recall vs QPS tradeoff curve.
Works for any FAISS index - set the exploration parameter externally.
"""
# For HNSW: ef_values are efSearch values
# For IVF: ef_values are nprobe values
ef_values = [10, 20, 40, 80, 160, 320]
results = []
for ef in ef_values:
# Caller must set index parameter before this function
# (e.g., index.hnsw.efSearch = ef or index.nprobe = ef)
t0 = time.perf_counter()
n_queries = min(500, len(query_vectors))
all_predicted = []
for q in query_vectors[:n_queries]:
distances, indices = index.search(
q.reshape(1, -1).astype(np.float32), k
)
all_predicted.append(list(indices[0]))
elapsed = time.perf_counter() - t0
qps = n_queries / elapsed
recalls = []
for true, pred in zip(true_neighbors[:n_queries], all_predicted):
true_set = set(true[:k])
pred_set = set(pred[:k])
recalls.append(len(true_set & pred_set) / k)
results.append({
"ef": ef,
"recall_at_k": float(np.mean(recalls)),
"qps": float(qps),
})
return results
Production Engineering Notesโ
Index building time is a production constraint. Building an HNSW index over 100M vectors with M=16, efConstruction=200 takes 2โ4 hours on a 32-core machine. Plan index rebuilds as offline jobs that build to a staging location, validate recall, and swap atomically. Never rebuild in-place on the live serving index.
Memory capacity planning for HNSW. Each vector stored in HNSW requires: raw vector storage ( bytes for float32) plus graph overhead ( bytes for the bidirectional links). For 10M vectors at d=768 and M=16: 10M ร 768 ร 4 = 30 GB for vectors + 10M ร 16 ร 2 ร 4 = 1.28 GB for graph. Total: ~31 GB plus OS overhead. Plan for 1.5โ2ร to have room for index construction.
HNSW and deletes. HNSW does not efficiently support deletes. Deleting a node from the middle of the graph requires relinking its neighbors. Most implementations mark vectors as deleted (soft delete) and filter them out during search, which wastes memory and degrades recall as more vectors accumulate as soft-deleted. Rebuild the index periodically to clean up.
Common Mistakesโ
:::danger Setting efSearch equal to efConstruction during index building efConstruction is a build-time quality parameter. Setting efSearch very high does not "use more of the efConstruction quality" - it just makes queries slower. The index quality is baked in at construction time. At search time, use the minimum efSearch that hits your recall target. :::
:::danger Using IVFPQ without adequate training data IVF k-means requires at least 30โ50 vectors per centroid for reliable cluster quality. With nlist=10000, you need at least 300,000 training vectors. Training on fewer vectors produces poor centroids, low recall, and non-deterministic behavior when vectors land in centroids far from their true neighbors. :::
:::warning Assuming algorithm A is always faster than algorithm B Performance depends heavily on hardware, dataset characteristics (dimensionality, distribution), and your specific recall target. HNSW beats IVFPQ at moderate scale; IVFPQ wins when RAM is the bottleneck. Always benchmark on your own dataset. ANN-Benchmarks results on SIFT or GIST may not reflect your embedding model's distribution. :::
:::tip Use FAISS's factory string for rapid prototyping FAISS supports an index factory string syntax for rapid index configuration:
# "IVF4096,Flat" = IVF with 4096 clusters, flat (no PQ)
# "IVF4096,PQ64" = IVFPQ with 4096 clusters and 64 sub-vectors
# "HNSW32,Flat" = HNSW with M=32
# "PCA64,IVF4096,PQ32" = PCA reduction to 64 dims, then IVFPQ
index = faiss.index_factory(d, "IVF4096,PQ96x8", faiss.METRIC_L2)
Use this for experiments. Specify parameters explicitly for production. :::
Interview Questionsโ
Q1: Explain how HNSW achieves sub-linear search complexity.
HNSW builds a multi-layer graph where higher layers have fewer nodes with long-range connections and lower layers have more nodes with short-range connections. Search starts at the top layer and greedily navigates toward the query, descending layer by layer. Each layer reduces the search space by a factor proportional to the average node degree . The result is hops to reach the right local region, then a thorough local search at the base layer. The key insight is that the exponential layer assignment ensures the graph has small-world properties - any two nodes can be connected in hops.
Q2: What are the M and efSearch parameters in HNSW, and how do you tune them?
M is the number of bidirectional links per node in the graph. Higher M means richer connectivity, better recall, more memory, and slower construction. Typical values: M=8 for speed, M=16 for balance, M=32โ64 for maximum recall. efSearch is the search beam width - the size of the candidate list maintained during graph traversal. Higher efSearch improves recall at the cost of more distance computations. Tune efSearch at query time: sweep from 10 to 500, measure recall@10 against exact search ground truth, and use the minimum value that hits your recall target (e.g., 0.95). Never set efSearch without measuring recall.
Q3: When would you choose IVFPQ over HNSW?
When RAM is the binding constraint. HNSW requires all vectors in memory: 100M vectors ร 768 dims ร 4 bytes = 307 GB, plus graph overhead. IVFPQ with 8-bit PQ compresses each vector to 96 bytes (from 3072), reducing the 100M vector footprint to ~9 GB. The tradeoff: IVFPQ has slightly lower recall at equivalent latency compared to HNSW due to quantization error. Choose IVFPQ when you cannot afford the RAM for full HNSW, or when going beyond 100M vectors on a single machine.
Q4: A dataset has high update frequency with frequent deletes. Which algorithm would you choose?
HNSW with periodic full rebuilds. HNSW supports incremental inserts (you can call index.add() at any time), which IVF/IVFPQ do not efficiently support. For deletes, use soft deletion (mark vectors as deleted in metadata, filter at query time) and schedule periodic index rebuilds to reclaim memory and restore recall degraded by tombstoned entries. The rebuild can be done in the background on a shadow copy, with atomic swap once recall validation passes.
Q5: Explain product quantization. What is the memory vs recall tradeoff?
PQ splits each d-dimensional vector into m sub-vectors of d/m dimensions each. Each sub-vector is quantized to one of 256 centroids (1 byte), learned by running separate k-means on each sub-vector dimension independently. The full vector is represented by m bytes instead of dร4 bytes - a 4d/mร compression ratio. For d=768, m=96: 32ร compression (3072 bytes โ 96 bytes). The recall loss comes from quantization error: the approximate distance computed from centroid lookups deviates from the true distance. In practice, recall@10 drops 3โ8% compared to exact IVF at the same nprobe value. The memory savings make it worthwhile for billion-scale search where exact storage is impossible.
Q6: You need to serve 1 billion 1536-dimensional vectors at recall@10 = 0.92 and p99 latency under 200ms. Design the solution.
1536-dimensional vectors at 1B items = 6 TB raw, which requires a disk-based approach. Use DiskANN or Qdrant's on-disk HNSW: build the HNSW graph structure, store full-precision vectors on NVMe SSD (random read latency 50โ100 ยตs), keep a PQ-compressed version in RAM as a coarse guide for navigation. For 200ms p99: NVMe I/O at 50 ยตs per random read ร ~20โ30 hops = ~1.5ms I/O latency, leaving ample room. Shard across 4โ8 machines for load distribution. Use 3 replicas per shard for fault tolerance. Index construction: run as an offline batch job on a machine with large RAM, then copy the index shards to serving nodes.
