What is approximate nearest neighbor?

Master the approximate nearest neighbor algorithms powering vector search - HNSW, IVF, IVF-PQ, ScaNN, and DiskANN with parameter tuning and recall-latency trade-offs.

How does HNSW work in practice?

Retrieval Algorithms and ANN covers approximate nearest neighbor, HNSW, IVF index from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/rag-systems/retrieval-algorithms-ann

What is the difference between approximate nearest neighbor and IVF index?

See the full breakdown at https://engineersofai.com/docs/llms/rag-systems/retrieval-algorithms-ann

Retrieval Algorithms and ANN

The 3-Second Query That Couldn't Ship

The ML infrastructure team at a mid-size fintech was building semantic search over 50 million financial documents - regulatory filings, analyst reports, internal research. The embedding pipeline worked beautifully. The vector store was populated. Then they ran the first production query.

3.2 seconds. For a single search.

The calculation was simple and devastating. Exact nearest neighbor search over 50M vectors at 1536 dimensions required computing 50M cosine similarities. Each computation is roughly $2d$ floating-point operations, so $50M \times 3072 = 153.6$ billion FLOPs per query. A modern CPU can do roughly 100 GFLOPS. That's 1.5 seconds theoretical minimum, and in practice with memory bandwidth constraints, more like 3-5 seconds.

At 100 queries per second (their target), they'd need 300-500 CPU cores dedicated to similarity computation. The infrastructure cost was prohibitive. The latency was unusable.

The solution was approximate nearest neighbor indexing - trading a small amount of retrieval accuracy (the approximate results differ from exact results by a few percent) for a 1000x reduction in query time. The first HNSW index they built brought that 3.2-second query to 8ms with 95% recall@10 versus exact search. They shipped.

This lesson explains how ANN algorithms work, how to configure them, and how to choose between them for your latency-recall-memory budget.

Why This Exists: The Curse of Dimensionality

Nearest neighbor search is easy in 2 or 3 dimensions - you build a spatial tree (KD-tree, ball tree) and search in $O(\log n)$ time. But these tree-based structures break down as dimensionality increases. The "curse of dimensionality" means that in high dimensions, the distance to the nearest neighbor and the distance to the farthest neighbor converge - all points look approximately equidistant. Tree partitioning stops being informative.

At 1536 dimensions (OpenAI embedding size), any spatial tree degrades to near-linear scan performance. You need fundamentally different approaches.

The ANN research community spent the 2000s and 2010s developing approaches that work in high dimensions. The major contributions:

LSH (Locality-Sensitive Hashing): hash similar vectors to the same bucket. Approximate, but hash collision rates are hard to control.
IVF (Inverted File Index, ~2003): cluster vectors, search only nearby clusters.
HNSW (2016, Malkov & Yashunin): navigable small-world graph, greedy graph search.
ScaNN (2020, Google): anisotropic quantization tuned for inner product search.
DiskANN (2019, Microsoft): disk-resident graph for billion-scale datasets.

Exact Nearest Neighbor: The Baseline

Before discussing approximations, understand what exact search looks like and why it fails.

Linear scan: For each query vector $q$ , compute $\text{sim}(q, v_i)$ for all $n$ vectors $v_i$ . Return the top $k$ .

Complexity: $O(n \cdot d)$ per query.

When exact search is viable:

Under 100K vectors: linear scan takes roughly 10-50ms depending on dimension
You can't tolerate approximation error (rare in practice - 99% recall is usually fine)
Using FAISS flat index with GPU acceleration (GPU does $\sim$ 100 TFLOPS, can handle 1M vectors in <10ms)

import faiss
import numpy as np

# Generate synthetic data
n_vectors = 100_000
dim = 1536
np.random.seed(42)
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
# Normalize for cosine similarity (converts to IP search)
faiss.normalize_L2(vectors)

# Exact search (IndexFlatIP = flat inner product, works with normalized vectors)
index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)

# Query
query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

import time
t0 = time.time()
scores, indices = index_flat.search(query, k=10)
print(f"Exact search on {n_vectors} vectors: {(time.time()-t0)*1000:.1f}ms")
# ~50-200ms depending on hardware

HNSW: The Dominant Algorithm

Hierarchical Navigable Small World (Malkov & Yashunin, 2016) is the algorithm behind most production vector databases. Understanding it is essential for tuning retrieval.

The Small-World Graph Intuition

Imagine you need to find someone in a social network of 100M people. You don't check all 100M - you start with people you know who have broad connections (e.g., well-connected professionals), ask them to refer you toward your target, and navigate through 6 degrees of separation in roughly $\log(n)$ hops.

HNSW builds exactly this kind of navigable graph. The "small-world" property means any two nodes can be reached via a short path. The "hierarchical" part means there's a multi-layer structure: top layers are sparse with long-range connections (like knowing international connectors), lower layers are dense with local connections (like knowing your immediate neighbors).

Construction

Building an HNSW index:

Start with an empty graph
For each new vector $v$ : a. Determine its insertion layer $\ell$ by sampling from an exponential distribution ( $P(\ell) \propto e^{-\ell}$ ) b. Starting from the entry point, greedily navigate from the top layer down to layer $\ell+1$ c. At layer $\ell$ down to layer 0, find the nearest neighbors of $v$ using beam search with beam width ef_construction d. Bidirectionally connect $v$ to its $M$ nearest neighbors at each layer (pruning to keep at most $M$ connections per node)

Key parameters:

M: Maximum connections per node per layer. Controls graph connectivity. Typical values: 8-64. Higher M = better recall, more memory, slower construction.
ef_construction: Beam width during construction. Controls how carefully we select neighbors. Typical values: 100-200. Higher = better graph quality, slower construction.

Search

import faiss
import numpy as np
import time

n_vectors = 1_000_000
dim = 1536

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# HNSW index parameters
M = 32              # connections per node (memory/quality trade-off)
ef_construction = 128  # build quality (higher = better graph, slower build)

index_hnsw = faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)
index_hnsw.hnsw.efConstruction = ef_construction

print(f"Building HNSW index on {n_vectors} vectors...")
t0 = time.time()
index_hnsw.add(vectors)
build_time = time.time() - t0
print(f"Build time: {build_time:.1f}s")

# Set ef (search beam width) - controls recall-latency trade-off at query time
index_hnsw.hnsw.efSearch = 64   # higher = more recall, slower queries

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

t0 = time.time()
scores, indices = index_hnsw.search(query, k=10)
latency_ms = (time.time() - t0) * 1000
print(f"HNSW search: {latency_ms:.2f}ms")

# Measure recall@10 vs exact search
def measure_recall(approx_index, exact_index, queries, k=10):
    _, exact_ids = exact_index.search(queries, k)
    _, approx_ids = approx_index.search(queries, k)
    recalls = []
    for exact, approx in zip(exact_ids, approx_ids):
        hits = len(set(exact) & set(approx))
        recalls.append(hits / k)
    return np.mean(recalls)

test_queries = np.random.randn(100, dim).astype(np.float32)
faiss.normalize_L2(test_queries)

index_flat = faiss.IndexFlatIP(dim)
index_flat.add(vectors)

recall = measure_recall(index_hnsw, index_flat, test_queries)
print(f"Recall@10 vs exact: {recall:.3f}")

The ef vs Recall/Latency Trade-Off

efSearch (or ef in some implementations) is the most important query-time parameter. It controls the beam width during graph search - how many candidates are tracked during navigation.

efSearch	Recall@10	Latency (1M vectors)
16	~0.80	~2ms
32	~0.90	~4ms
64	~0.95	~8ms
128	~0.98	~16ms
200	~0.99	~25ms

The right value depends on your SLA. For most RAG applications, Recall@10 of 0.95+ at under 10ms is achievable and sufficient.

IVF: Cluster-Based Retrieval

Inverted File Index partitions the vector space into $nlist$ clusters (using k-means). To search, it identifies the $nprobe$ closest clusters to the query and searches only within those clusters.

Intuition: If your query is about financial regulation, the most relevant documents are probably clustered near other financial documents. You don't need to check the cooking recipe cluster.

Strengths: Very memory-efficient. Training the cluster centroids is a one-time cost. Easy to understand and debug.

Limitations: Boundary artifacts - vectors near cluster boundaries may be in the wrong cluster. Low nprobe = fast but poor recall; high nprobe = good recall but approaches linear scan.

import faiss
import numpy as np

n_vectors = 1_000_000
dim = 1536
nlist = 1024     # number of clusters - rule of thumb: sqrt(n) to 4*sqrt(n)

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# IVF requires training on a sample to learn cluster centroids
quantizer = faiss.IndexFlatIP(dim)  # exact search for centroid assignment
index_ivf = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on a sample (usually 10-100x nlist samples)
train_size = min(100 * nlist, n_vectors)
index_ivf.train(vectors[:train_size])
index_ivf.add(vectors)

# nprobe: how many clusters to search - the recall/latency dial
index_ivf.nprobe = 32  # check 32/1024 = 3% of clusters

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)
scores, indices = index_ivf.search(query, k=10)

nlist selection:

Under 1M vectors: nlist = int(4 * sqrt(n_vectors)) - roughly 4000 for 1M vectors
Over 1M vectors: nlist = 65536 (256 IVF centers with sub-quantization)

nprobe selection: Start at nprobe = nlist / 32 (check ~3% of clusters). Measure recall. Increase until you hit recall target or latency budget.

IVF-PQ: Memory Compression with Product Quantization

At scale (100M+ vectors), storing full 1536-dimensional float32 vectors requires enormous memory. 100M × 1536 × 4 bytes = 614 GB. This doesn't fit in RAM.

Product Quantization (PQ) compresses vectors by dividing the $d$ -dimensional space into $m$ subspaces of $d/m$ dimensions each, and encoding each subspace as a code from a codebook of $k^*$ centroids (typically $k^* = 256$ ). Each subspace is stored as 1 byte (the centroid index). A 1536-dim vector is compressed from 6144 bytes to 96 bytes (at $m=96$ ) - a 64x compression ratio.

import faiss
import numpy as np

n_vectors = 10_000_000  # 10M vectors
dim = 1536

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# IVF-PQ: cluster-based search with compressed vector storage
nlist = 8192   # clusters
M = 96         # number of PQ subspaces (dim must be divisible by M)
# 1536 / 96 = 16 dims per subspace
# Each subspace uses 1 byte (256 centroids)
# Total per vector: 96 bytes (vs 6144 bytes for float32)

quantizer = faiss.IndexFlatIP(dim)
index_ivfpq = faiss.IndexIVFPQ(quantizer, dim, nlist, M, 8)
# 8 = bits per code (256 centroids per subspace)

# Training requires sufficient data to learn both IVF and PQ centroids
train_vectors = vectors[:500_000]
print("Training IVF-PQ index...")
index_ivfpq.train(train_vectors)
index_ivfpq.add(vectors)

print(f"Memory with full vectors: {n_vectors * dim * 4 / 1e9:.1f} GB")
# IVF-PQ memory: much smaller
# n_vectors * M bytes + nlist * dim * 4 bytes (centroids)
compressed_gb = (n_vectors * M + nlist * dim * 4) / 1e9
print(f"Memory with IVF-PQ: {compressed_gb:.1f} GB")

index_ivfpq.nprobe = 64
query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)
scores, indices = index_ivfpq.search(query, k=10)

The compression-recall trade-off: PQ introduces quantization error. At $m=96$ (96 bytes per 1536-dim vector), recall@10 typically drops from ~0.99 (exact) to ~0.85. At $m=192$ (192 bytes), recall is ~0.92. The right $m$ depends on your quality requirements.

ScaNN: Anisotropic Quantization

Scalable Nearest Neighbors (ScaNN) was developed at Google (2020) and is used in production at Google Search scale. The key insight: when you're looking for the top- $k$ neighbors, you care most about quantizing the vectors that have the highest dot product with the query - not the "average" vector. Standard PQ treats all vectors equally; ScaNN applies anisotropic quantization that prioritizes accuracy for high-similarity vectors.

ScaNN achieves 10-20% better recall at the same memory budget compared to standard IVF-PQ on most benchmarks. It's available as the scann Python package:

import scann
import numpy as np

n_vectors = 1_000_000
dim = 1536
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
# ScaNN uses MIPS (Maximum Inner Product Search) - normalize for cosine
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

searcher = (
    scann.scann_ops_pybind.builder(vectors, 10, "dot_product")
    .tree(num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000)
    .score_ah(2, anisotropic_quantization_threshold=0.2)  # key ScaNN innovation
    .reorder(100)  # rerank top-100 with exact scores
    .build()
)

query = np.random.randn(1, dim).astype(np.float32)
query = query / np.linalg.norm(query)
neighbors, distances = searcher.search_batched(query)

ScaNN is the right choice when you need maximum performance at fixed memory budget and don't mind the Google-ecosystem dependency.

DiskANN: Beyond-RAM Billion-Scale Search

DiskANN (Microsoft, 2019) solves the problem that even IVF-PQ can't fully address: datasets that are too large to fit their index in RAM at all.

DiskANN builds an HNSW-like graph but stores vectors on SSD (NVMe). The graph structure is carefully designed to minimize random disk accesses during search - it batches disk reads for graph neighbors, exploiting NVMe's high random-read IOPS.

At 1 billion vectors at 1536 dimensions:

Full float32: 6 TB RAM - impossible
IVF-PQ (96 bytes): 96 GB RAM - expensive
DiskANN: ~100 GB SSD + ~10 GB RAM (for compressed graph in RAM) - feasible

DiskANN is available as diskann-py or through Microsoft's AzureSearch. It's the production solution at Microsoft-scale document retrieval.

The ann-benchmarks.com Reference

ann-benchmarks.com is the canonical reference for ANN algorithm comparisons. It benchmarks algorithms across multiple real-world datasets at various recall levels, reporting queries-per-second (QPS) vs recall@10.

How to read the benchmark:

Each point on the QPS vs recall curve represents a specific parameter setting (e.g., different efSearch values for HNSW)
You want algorithms that are furthest to the upper-right (high QPS at high recall)
HNSW consistently dominates: at recall@10 = 0.95, HNSW achieves 10-100x higher QPS than IVF-Flat on most datasets

The key insight from benchmarks: The right algorithm depends on your index size and latency requirements:

Scenario	Recommended Algorithm
Under 500K vectors	HNSW-Flat
500K - 50M, memory constrained	IVF-PQ
50M - 1B, memory not constrained	HNSW-PQ or ScaNN
1B+, won't fit in RAM	DiskANN
Maximum accuracy, GPU available	Flat + GPU

FAISS HNSW vs IVF Comparison

import faiss
import numpy as np
import time

def benchmark_index(index, vectors, queries, k=10, name=""):
    """Measure recall@k and QPS vs exact search."""
    # Build flat index for ground truth
    flat = faiss.IndexFlatIP(vectors.shape[1])
    flat.add(vectors)
    _, exact_ids = flat.search(queries, k)

    # Query the approximate index
    t0 = time.time()
    n_warmup = 10
    for q in queries[:n_warmup]:
        index.search(q.reshape(1, -1), k)
    warmup_time = time.time() - t0

    t0 = time.time()
    _, approx_ids = index.search(queries, k)
    total_time = time.time() - t0

    # Compute recall@k
    recalls = [
        len(set(e) & set(a)) / k
        for e, a in zip(exact_ids, approx_ids)
    ]
    mean_recall = np.mean(recalls)
    qps = len(queries) / total_time

    print(f"{name}: Recall@{k}={mean_recall:.3f}, QPS={qps:.0f}, "
          f"Latency={(1000/qps):.1f}ms")

n = 500_000
d = 128  # use lower dim for fast demo
rng = np.random.default_rng(42)
data = rng.standard_normal((n, d)).astype(np.float32)
faiss.normalize_L2(data)
queries = rng.standard_normal((100, d)).astype(np.float32)
faiss.normalize_L2(queries)

# HNSW
hnsw = faiss.IndexHNSWFlat(d, 32, faiss.METRIC_INNER_PRODUCT)
hnsw.hnsw.efConstruction = 128
hnsw.hnsw.efSearch = 64
hnsw.add(data)
benchmark_index(hnsw, data, queries, name="HNSW(M=32, ef=64)")

# IVF-Flat
nlist = 512
quantizer = faiss.IndexFlatIP(d)
ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
ivf.train(data)
ivf.add(data)
ivf.nprobe = 32
benchmark_index(ivf, data, queries, name="IVF-Flat(nlist=512, nprobe=32)")

# IVF-PQ
ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, 16, 8)
ivfpq.train(data)
ivfpq.add(data)
ivfpq.nprobe = 32
benchmark_index(ivfpq, data, queries, name="IVF-PQ(nlist=512, m=16, nprobe=32)")

Parameter Tuning Guide

Production Engineering Notes

Index persistence: FAISS indexes don't auto-persist. Always save after building: faiss.write_index(index, "my_index.faiss"). For production, version your index files alongside your vector data.

Index rebuild vs incremental update: HNSW supports efficient dynamic insertion - adding new vectors doesn't require a full rebuild. However, FAISS's HNSW implementation requires a full rebuild for deletions (it doesn't support true deletes). IVF-based indexes also require retraining if your data distribution changes significantly. Plan a periodic rebuild schedule.

Multi-GPU scaling: FAISS supports GPU acceleration and multi-GPU with faiss.StandardGpuResources(). A single A100 GPU can search 1M 1536-dim vectors in under 5ms, versus 50-200ms on CPU. For latency-critical applications with millions of vectors, GPU FAISS is transformative.

Thread safety: FAISS indexes are not thread-safe for concurrent adds. For concurrent reads (queries), they're safe. Use read-write locking or index swapping (build new index offline, swap atomically) for production update patterns.

Common Mistakes

:::danger Building HNSW Without Normalization for Cosine Similarity FAISS HNSW with METRIC_INNER_PRODUCT correctly computes cosine similarity only on L2-normalized vectors. On unnormalized vectors, it computes raw dot product, which correlates with vector magnitude, not semantic similarity. Always normalize before adding to the index and normalize query vectors at search time. Use faiss.normalize_L2(vectors) in-place. :::

:::danger Using ef_construction = ef_Search Value ef_construction controls build quality - how carefully neighbors are selected during index construction. efSearch controls search quality - how many candidates are explored during query. Setting efSearch > ef_construction is wasteful: you can't recover graph quality at query time that wasn't built in during construction. Always build with high ef_construction (128-200). Tune efSearch downward at query time to hit your latency budget. :::

:::warning Not Measuring Recall on Your Actual Data Algorithm benchmarks from ann-benchmarks.com use specific public datasets. Your data distribution may differ significantly. The correct approach: sample 1000+ representative queries from your actual query log with known relevant documents. Measure recall@10 for your specific index parameters. A 5% recall difference you ignored in benchmarks may be a 5% user satisfaction gap in production. :::

Interview Questions and Answers

Q: Explain how HNSW works at a high level. Why does it beat tree-based approaches in high dimensions?

A: HNSW builds a hierarchical graph where each node (vector) has connections at multiple layers. Top layers have long-range connections to globally similar vectors; lower layers have dense local connections. Search starts at the top layer, greedily moves toward the query, drops to the next layer, and repeats. This exploits the "small-world" property - any two vectors can be reached in $O(\log n)$ hops. Tree-based approaches (KD-trees, ball trees) fail in high dimensions because axis-aligned splits become uninformative - in 1536 dimensions, orthogonal projections don't separate similar from dissimilar vectors. HNSW avoids this by learning connections directly in the original space rather than projecting. The key parameters are M (connections per node, controlling memory and recall) and efSearch (search beam width, controlling the recall-latency trade-off).

Q: What is product quantization and what does it trade for memory savings?

A: PQ divides the $d$ -dimensional vector space into $m$ subspaces of $d/m$ dimensions each. Each subspace is independently quantized using a codebook of $k^*$ centroids (typically 256). A vector is encoded as $m$ codebook indices - requiring $m$ bytes instead of $d \times 4$ bytes for float32. The compression ratio is $4d / m$ - for $d=1536$ , $m=96$ : 64x compression. The trade-off: quantization error. The reconstructed vector is not identical to the original; it's the concatenation of the nearest codebook centroids in each subspace. Similarity computed on quantized vectors has approximation error relative to exact similarity. At $m = d/16$ (e.g., 96 for 1536-dim), typical recall@10 drops by 5-10% compared to exact. At $m = d/8$ , the drop is 1-3%. PQ is always used with IVF (IVF-PQ), where the coarse cluster selection provides additional precision.

Q: What's the difference between nlist and nprobe in an IVF index?

A: nlist is set at index build time: the number of k-means clusters the vector space is partitioned into. A good starting point is nlist = int(4 * sqrt(n)) for $n$ vectors. nprobe is set at query time: the number of clusters to search. With nlist=1024 and nprobe=32, each query searches 32/1024 = 3% of the vector space. Higher nprobe = higher recall + higher latency. This is the key recall-latency knob for IVF indexes. If nprobe = nlist, IVF degrades to a linear scan with centroid lookup overhead. In practice: set nlist to achieve cluster sizes of 1000-10000 vectors, then tune nprobe until you hit your recall target within your latency budget.

Q: When would you choose IVF-PQ over HNSW?

A: IVF-PQ is preferable when: (1) memory is the binding constraint - IVF-PQ compresses vectors 10-100x (60x for 1536-dim at 96 bytes/vector vs 6144 bytes float32), while HNSW stores full vectors; (2) you have 50M+ vectors that don't fit in RAM as an HNSW index; (3) build time matters - HNSW construction is $O(n \log n)$ with significant constant factors; IVF training is a one-time k-means, and adding vectors is fast. Choose HNSW when memory is not constrained and you want maximum recall per millisecond - HNSW consistently outperforms IVF-PQ at the same recall level in terms of QPS on ann-benchmarks. The practical rule: HNSW for under 50M vectors with adequate RAM; IVF-PQ for larger scale.

Q: Your ANN index achieves 95% Recall@10. What does this mean and is it good enough for production RAG?

A: Recall@10 of 0.95 means that 95% of the time, all 10 of the true nearest neighbors appear in the returned top-10 results. The other 5% of the time, at least one of the true top-10 is replaced by a slightly worse candidate. For RAG, this is generally excellent. Consider: if your embedding model correctly orders the relevant document in the top 3 for 90% of queries, and your ANN has 95% recall, the combined miss rate is roughly 5-10% of queries where ANN error causes a relevant document to be missed. This is typically far smaller than other error sources (embedding model limitations, chunking artifacts, query-document mismatch). The rare cases where the exact vs approximate difference matters can be caught by adding a reranking step, which reorders candidates using a more expensive model. In practice, tuning for 95-99% Recall@10 is the standard production target for retrieval-critical applications.

Building a FAISS Index Manager for Production

A complete production wrapper around FAISS with persistence, versioning, and recall measurement:

import faiss
import numpy as np
import pickle
import os
from pathlib import Path
from typing import List, Tuple, Optional
import time

class FAISSIndexManager:
    """
    Production FAISS index manager with persistence, versioning,
    and recall measurement capabilities.
    """

    def __init__(
        self,
        dim: int = 1536,
        index_type: str = "hnsw",  # "hnsw", "ivf", "ivfpq", "flat"
        m: int = 32,                # HNSW: connections per layer
        ef_construction: int = 128, # HNSW: build quality
        nlist: int = 1024,          # IVF: number of clusters
        m_pq: int = 96,             # IVF-PQ: number of PQ subspaces
        storage_path: str = "./faiss_index",
    ):
        self.dim = dim
        self.index_type = index_type
        self.storage_path = Path(storage_path)
        self.storage_path.mkdir(exist_ok=True)
        self.metadata: List[dict] = []  # parallel array of metadata

        # Build the appropriate index
        if index_type == "flat":
            self.index = faiss.IndexFlatIP(dim)
        elif index_type == "hnsw":
            self.index = faiss.IndexHNSWFlat(dim, m, faiss.METRIC_INNER_PRODUCT)
            self.index.hnsw.efConstruction = ef_construction
        elif index_type == "ivf":
            quantizer = faiss.IndexFlatIP(dim)
            self.index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
        elif index_type == "ivfpq":
            quantizer = faiss.IndexFlatIP(dim)
            self.index = faiss.IndexIVFPQ(quantizer, dim, nlist, m_pq, 8)
        else:
            raise ValueError(f"Unknown index type: {index_type}")

        # Query-time parameters
        if index_type == "hnsw":
            self.index.hnsw.efSearch = 64
        elif index_type in ("ivf", "ivfpq"):
            self._nprobe = 32

    def train(self, vectors: np.ndarray):
        """Train IVF-based indexes. Not needed for HNSW or Flat."""
        if hasattr(self.index, 'train') and not self.index.is_trained:
            print(f"Training {self.index_type} index on {len(vectors)} vectors...")
            t0 = time.time()
            self.index.train(vectors)
            print(f"Training complete in {time.time()-t0:.1f}s")
            if self.index_type in ("ivf", "ivfpq"):
                self.index.nprobe = self._nprobe

    def add(self, vectors: np.ndarray, metadata: Optional[List[dict]] = None):
        """Add normalized vectors to the index."""
        assert vectors.shape[1] == self.dim, f"Expected dim {self.dim}, got {vectors.shape[1]}"
        # Ensure normalized
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        vectors = vectors / np.clip(norms, 1e-8, None)
        self.index.add(vectors)
        if metadata:
            self.metadata.extend(metadata)
        else:
            self.metadata.extend([{}] * len(vectors))

    def search(
        self,
        query: np.ndarray,
        k: int = 10,
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Search for k nearest neighbors. Returns (scores, indices)."""
        if query.ndim == 1:
            query = query.reshape(1, -1)
        # Normalize query
        norms = np.linalg.norm(query, axis=1, keepdims=True)
        query = query / np.clip(norms, 1e-8, None)
        return self.index.search(query, k)

    def set_ef_search(self, ef: int):
        """Adjust HNSW search quality at query time."""
        if self.index_type == "hnsw":
            self.index.hnsw.efSearch = ef

    def set_nprobe(self, nprobe: int):
        """Adjust IVF search breadth at query time."""
        if self.index_type in ("ivf", "ivfpq"):
            self.index.nprobe = nprobe

    def measure_recall(
        self,
        test_vectors: np.ndarray,
        k: int = 10,
        n_samples: int = 100,
    ) -> float:
        """Measure Recall@k vs exact search on a sample of vectors."""
        flat = faiss.IndexFlatIP(self.dim)
        flat.add(self.index.reconstruct_n(0, self.index.ntotal)
                 if hasattr(self.index, 'reconstruct_n') else test_vectors)

        sample_idx = np.random.choice(len(test_vectors), min(n_samples, len(test_vectors)), replace=False)
        samples = test_vectors[sample_idx]

        _, exact_ids = flat.search(samples, k)
        _, approx_ids = self.search(samples, k)

        recalls = [
            len(set(e) & set(a)) / k
            for e, a in zip(exact_ids, approx_ids)
        ]
        return float(np.mean(recalls))

    def save(self, name: str = "index"):
        """Persist index and metadata to disk."""
        faiss.write_index(self.index, str(self.storage_path / f"{name}.faiss"))
        with open(self.storage_path / f"{name}_metadata.pkl", "wb") as f:
            pickle.dump(self.metadata, f)
        print(f"Saved {self.index.ntotal} vectors to {self.storage_path}/{name}")

    def load(self, name: str = "index"):
        """Load index from disk."""
        self.index = faiss.read_index(str(self.storage_path / f"{name}.faiss"))
        with open(self.storage_path / f"{name}_metadata.pkl", "rb") as f:
            self.metadata = pickle.load(f)
        if self.index_type in ("ivf", "ivfpq"):
            self.index.nprobe = self._nprobe
        print(f"Loaded {self.index.ntotal} vectors from {self.storage_path}/{name}")

    @property
    def size(self) -> int:
        return self.index.ntotal

    def info(self):
        print(f"Index type: {self.index_type}")
        print(f"Dimension: {self.dim}")
        print(f"Vectors: {self.size}")
        memory_estimate_mb = self.size * self.dim * 4 / 1e6
        print(f"Memory estimate (float32): {memory_estimate_mb:.0f} MB")


# Example usage
manager = FAISSIndexManager(
    dim=1536,
    index_type="hnsw",
    m=32,
    ef_construction=128,
    storage_path="./my_index"
)

# Generate and add vectors
rng = np.random.default_rng(42)
vectors = rng.standard_normal((10000, 1536)).astype(np.float32)
manager.add(vectors, metadata=[{"id": i} for i in range(10000)])

# Search
query = rng.standard_normal(1536).astype(np.float32)
scores, indices = manager.search(query, k=10)

# Tune ef for different recall/latency trade-offs
for ef in [16, 32, 64, 128]:
    manager.set_ef_search(ef)
    t0 = time.time()
    scores, indices = manager.search(query, k=10)
    latency = (time.time() - t0) * 1000
    print(f"efSearch={ef}: {latency:.2f}ms")

manager.save("production_v1")
manager.info()

The Memory-Recall-Latency Cube

ANN index selection is a three-way trade-off. Visualizing it helps make decisions:

Configuration	Memory (1M × 1536-dim)	Recall@10	Latency
Flat (exact)	6 GB	1.00	200ms
HNSW M=8	6.5 GB	0.88	2ms
HNSW M=32	7.5 GB	0.97	5ms
HNSW M=64	9 GB	0.99	9ms
IVF-Flat nprobe=16	6 GB	0.82	8ms
IVF-Flat nprobe=64	6 GB	0.94	30ms
IVF-PQ m=96 nprobe=32	200 MB	0.83	5ms
IVF-PQ m=192 nprobe=64	350 MB	0.91	10ms

For most RAG applications: HNSW with M=32 and efSearch=64 is the starting point. It offers excellent recall (97%) with fast query times and manageable memory. Only deviate from this when your constraints specifically require it.

Index Warm-Up and Cold Start Mitigation

HNSW indexes must be loaded into RAM to serve queries. After a service restart, the first few queries may be slow as the OS page cache warms. Mitigate this:

import faiss
import numpy as np
import time

class WarmableIndexManager:
    """HNSW index manager with warm-up on load."""

    def __init__(self, index_path: str):
        self.index_path = index_path
        self.index = None

    def load_and_warm(self, n_warmup_queries: int = 100):
        """Load index and run warm-up queries to populate OS page cache."""
        print(f"Loading index from {self.index_path}...")
        t0 = time.time()
        self.index = faiss.read_index(self.index_path)
        load_time = time.time() - t0
        print(f"Loaded in {load_time:.1f}s ({self.index.ntotal} vectors)")

        if self.index.ntotal == 0:
            return

        # Run warm-up queries with random vectors
        print(f"Warming up with {n_warmup_queries} queries...")
        dim = self.index.d
        rng = np.random.default_rng(42)
        warmup_queries = rng.standard_normal((n_warmup_queries, dim)).astype(np.float32)
        faiss.normalize_L2(warmup_queries)

        t0 = time.time()
        self.index.search(warmup_queries, 10)
        warmup_time = time.time() - t0
        print(f"Warm-up complete in {warmup_time:.1f}s")
        print(f"Expected query latency: {warmup_time*1000/n_warmup_queries:.1f}ms")

    def search(self, query: np.ndarray, k: int = 10):
        if query.ndim == 1:
            query = query.reshape(1, -1)
        norm = np.linalg.norm(query)
        if norm > 0:
            query = query / norm
        return self.index.search(query, k)

ANN Index Monitoring in Production

Once deployed, monitor your ANN index health continuously:

import time
import statistics
from typing import List, Optional
from collections import deque
import numpy as np

class ANNIndexMonitor:
    """
    Monitor ANN index query latency and recall in production.
    Detects degradation and triggers alerts.
    """

    def __init__(
        self,
        index,
        exact_index,  # Flat index for recall computation
        window_size: int = 1000,
        latency_alert_ms: float = 50.0,
        recall_alert_threshold: float = 0.85,
    ):
        self.index = index
        self.exact_index = exact_index
        self.latency_window = deque(maxlen=window_size)
        self.recall_window = deque(maxlen=100)
        self.latency_alert_ms = latency_alert_ms
        self.recall_alert_threshold = recall_alert_threshold

    def search_with_monitoring(
        self,
        query: np.ndarray,
        k: int = 10,
        check_recall: bool = False,
    ) -> tuple:
        """Search while collecting latency metrics."""
        if query.ndim == 1:
            query = query.reshape(1, -1)

        t0 = time.perf_counter()
        scores, indices = self.index.search(query, k)
        latency_ms = (time.perf_counter() - t0) * 1000

        self.latency_window.append(latency_ms)

        # Periodically check recall (expensive - only 1% of queries)
        if check_recall and len(self.latency_window) % 100 == 0:
            _, exact_indices = self.exact_index.search(query, k)
            recall = len(set(indices[0]) & set(exact_indices[0])) / k
            self.recall_window.append(recall)

        return scores, indices

    def get_metrics(self) -> dict:
        """Compute current performance metrics."""
        if not self.latency_window:
            return {}

        latencies = list(self.latency_window)
        metrics = {
            "p50_latency_ms": statistics.median(latencies),
            "p95_latency_ms": np.percentile(latencies, 95),
            "p99_latency_ms": np.percentile(latencies, 99),
            "mean_latency_ms": statistics.mean(latencies),
            "query_count": len(latencies),
        }

        if self.recall_window:
            metrics["mean_recall_at_k"] = np.mean(list(self.recall_window))
            metrics["recall_samples"] = len(self.recall_window)

        # Alerts
        metrics["latency_alert"] = metrics["p95_latency_ms"] > self.latency_alert_ms
        if "mean_recall_at_k" in metrics:
            metrics["recall_alert"] = metrics["mean_recall_at_k"] < self.recall_alert_threshold

        return metrics

    def report(self):
        """Print a formatted performance report."""
        m = self.get_metrics()
        print(f"ANN Index Performance ({m.get('query_count', 0)} queries):")
        print(f"  P50 latency: {m.get('p50_latency_ms', 0):.1f}ms")
        print(f"  P95 latency: {m.get('p95_latency_ms', 0):.1f}ms")
        print(f"  P99 latency: {m.get('p99_latency_ms', 0):.1f}ms")
        if "mean_recall_at_k" in m:
            print(f"  Recall@10: {m['mean_recall_at_k']:.3f}")
        if m.get("latency_alert"):
            print("  ALERT: P95 latency exceeds threshold!")
        if m.get("recall_alert"):
            print("  ALERT: Recall falling below threshold!")

Summary: Choosing Your ANN Algorithm

The ANN algorithm decision tree for RAG systems is straightforward:

Start with HNSW. It is the best-performing algorithm for most RAG workloads under 50M vectors with adequate RAM. Use M=32 and efConstruction=128 for building; tune efSearch between 32 and 128 to hit your recall/latency target.
Switch to IVF-PQ only if memory is the binding constraint. At 100M+ vectors, HNSW may require more RAM than you have. IVF-PQ compresses 64x at modest recall cost.
Use DiskANN for billion-scale datasets. If your corpus is too large to fit even compressed vectors in RAM, DiskANN enables NVMe-resident indexes with acceptable latency.
Always measure recall@10 vs your exact-search baseline before deploying any ANN index to production. A 90% recall index means 10% of queries miss a relevant document that should have been retrieved.
Monitor in production. Query latency and recall can degrade as the index accumulates updates. Schedule periodic index rebuilds during low-traffic windows.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The 3-Second Query That Couldn't Ship​

Why This Exists: The Curse of Dimensionality​

Exact Nearest Neighbor: The Baseline​

HNSW: The Dominant Algorithm​

The Small-World Graph Intuition​

Construction​

Search​

The ef vs Recall/Latency Trade-Off​

IVF: Cluster-Based Retrieval​

IVF-PQ: Memory Compression with Product Quantization​

ScaNN: Anisotropic Quantization​

DiskANN: Beyond-RAM Billion-Scale Search​

The ann-benchmarks.com Reference​

FAISS HNSW vs IVF Comparison​

Parameter Tuning Guide​

Production Engineering Notes​

Common Mistakes​

Interview Questions and Answers​

Building a FAISS Index Manager for Production​

The Memory-Recall-Latency Cube​

Index Warm-Up and Cold Start Mitigation​

ANN Index Monitoring in Production​

Summary: Choosing Your ANN Algorithm​