:::tip 🎮 Interactive Playground Visualize this concept: Try the Vector Search Explorer demo on the EngineersOfAI Playground - no code required. :::

Vector Search in Practice

Reading time: 45–55 minutes Interview relevance: Very High - vector search and ANN indexing are core topics in AI infrastructure system design interviews Target roles: AI Engineer, ML Engineer, Backend Engineer (AI), Platform Engineer, MLOps Engineer

From 50ms to 8 Seconds

The team's demo was flawless. Their RAG system answered questions about their legal document corpus with stunning accuracy. The retrieval was fast - under 50 milliseconds - and the system felt snappy and responsive. Investors were impressed. The team shipped to production.

Three weeks later, users started complaining. Query responses were taking 6, 7, sometimes 8 seconds. The system had not changed. The number of users had not dramatically increased. But the document corpus had grown - from the 10,000 documents in the demo to just over 1 million documents ingested over three weeks as the team backfilled their entire document archive.

The engineering team pulled up the profiler. The embedding step was 40ms. The Claude generation step was 1,200ms. The vector search step? 6,800ms. That was where the time went.

The root cause was simple and embarrassing in retrospect: they had been running exact nearest neighbor search - a brute-force dot product computation between the query embedding and every single document embedding in the database. At 10,000 documents, the computation took 4ms. At 1 million documents, each query required 1 million dot product operations across 1024-dimensional vectors - roughly 1 billion floating-point multiplications - and that takes over 6 seconds on a single CPU.

The fix was not to buy faster servers. The fix was to stop searching exactly and start searching approximately. Approximate Nearest Neighbor (ANN) indexing trades a small amount of recall (you might miss 2–5% of the truly nearest vectors) for a dramatic speed improvement - from O(n) brute force to O(log n) indexed search. The same query that took 6.8 seconds under brute force takes 8ms with an HNSW index.

The team rebuilt their vector index using HNSW, re-ran their retrieval quality benchmarks (the recall drop was under 2%), and deployed. Query latency dropped to 52ms. The lesson cost them three weeks of user-facing degradation and one uncomfortable all-hands explanation. This lesson will cost you only the next 45 minutes.

Vector Search Fundamentals

Exact vs. Approximate: The Core Trade-off

Exact nearest neighbor (ENN) search finds the provably correct top-k most similar vectors. It guarantees 100% recall. It achieves this by computing similarity against every vector in the database. The time complexity is $O(n \cdot d)$ where $n$ is the number of vectors and $d$ is the number of dimensions.

At small scale this is fine:

10,000 vectors, 1024 dims: ~40ms per query (manageable)
100,000 vectors, 1024 dims: ~400ms per query (borderline)
1,000,000 vectors, 1024 dims: ~4,000ms per query (unacceptable)
10,000,000 vectors, 1024 dims: ~40,000ms per query (catastrophic)

Approximate nearest neighbor (ANN) search builds an index data structure during ingestion that enables sub-linear query time. You trade a small amount of recall for dramatic speed gains:

1,000,000 vectors, HNSW index: 5–20ms per query at 95–99% recall
Same data, brute force: 4,000ms per query at 100% recall

For RAG applications, 95% recall is excellent - missing 5% of the truly nearest vectors has negligible impact on answer quality, because you are retrieving top-5 from millions of candidates, and any of the top 20–30 would likely produce a good answer anyway.

Distance Metrics: Which One and When

Cosine similarity: Measures the angle between vectors. Range: -1 to 1 (higher = more similar). Use when: your vectors may have different magnitudes (raw embeddings before normalization). Most natural for semantic similarity.

Dot product: Equivalent to cosine similarity for unit-normalized vectors. Computationally faster than cosine. Use when: your embedding model outputs normalized vectors (most modern models do). The recommended default for normalized embeddings.

Euclidean distance (L2): Measures the straight-line distance between two points in space. Lower = more similar. Avoid for high-dimensional embedding vectors - suffers from the curse of dimensionality. Use L2 only when working with structured feature vectors (not embeddings) where magnitude carries meaning.

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Angle-based similarity. -1 to 1. Higher = more similar."""
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def dot_product_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """For normalized vectors: identical to cosine similarity, faster."""
    return float(np.dot(a, b))

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """Magnitude-based distance. Lower = more similar. Avoid for embeddings."""
    return float(np.linalg.norm(a - b))

# For normalized embeddings, cosine == dot product
a = np.array([0.6, 0.8])                    # Unit vector
b = np.array([0.8, 0.6])                    # Another unit vector
print(f"Cosine:     {cosine_similarity(a, b):.4f}")
print(f"Dot product:{dot_product_similarity(a, b):.4f}")  # Same result

Top-k: What k Should You Use?

When you retrieve top-k chunks from vector search, you are deciding how much context to pass to Claude. The trade-offs:

k=3: Minimal context, fastest generation, lowest token cost, higher risk of missing the relevant chunk
k=5: Good default for focused domains with high-quality chunking
k=10: Better coverage for broad questions, more tokens, slightly slower generation
k=20: Pre-reranking retrieval - retrieve widely and rerank down to top-5 (covered in the next lesson)

For most RAG systems, retrieve k=10–20 from the vector database and then rerank down to k=5 for the generation step. This two-stage approach gives you breadth in retrieval and precision in generation.

ANN Index Types: A Deep Dive

HNSW: The Modern Default

Hierarchical Navigable Small World (HNSW) is the dominant ANN algorithm for production vector search. It was introduced by Malkov and Yashunin in 2018 and is now the default index type in Qdrant, Weaviate, and pgvector, and is available in Pinecone and FAISS.

How HNSW works (intuition): Imagine you are looking for a house in an unfamiliar city. You would not knock on every door - you would start by looking at a map of the city (high level), navigate to the right neighborhood, then the right street, then search nearby houses. HNSW does this with vectors: it builds a multi-layer graph where upper layers have fewer, more spread-out nodes connected to far-away neighbors, and lower layers have dense local connections. A query starts at the top (coarse navigation) and descends to the bottom (fine-grained search).

HNSW build parameters:

M - the number of bidirectional connections each node has to its neighbors. Higher M = more connections = better recall but more memory and slower build time. Typical range: 16–64. Default: 16–32.

ef_construction - how many neighbor candidates are explored during the index build phase. Higher = better index quality but slower build. Typical range: 64–400. Default: 100–200.

HNSW query parameters:

ef_search (also called ef) - how many candidates to explore during search. This is the primary recall vs. latency knob. Higher = better recall, slower query. Can be tuned at query time without rebuilding the index.

ef_search=16:   fast queries, ~85-90% recall
ef_search=64:   balanced, ~95-97% recall
ef_search=200:  slow queries, ~99%+ recall
ef_search=1000: very slow, ~99.9% recall (approaching exact)

HNSW characteristics:

Query time: O(log n) average
Memory: ~4-8 bytes per vector dimension plus graph structure (roughly 1.5-2x raw vector data)
Build time: slow for very large datasets (hours for 100M+ vectors)
Strength: best recall at low latency for medium datasets (up to ~100M vectors)
Weakness: high memory usage, slow incremental updates

IVF: Inverted File Index

Inverted File Index (IVF) takes a different approach. During build time, it clusters the vector space into nlist clusters using k-means. Each vector is assigned to its nearest cluster centroid. At query time, instead of searching all vectors, it searches only the vectors in the nprobe most relevant clusters.

How IVF works (intuition): Imagine the same city search, but instead of a layered map, someone gave you a city directory organized by neighborhood. You find the k nearest neighborhoods to your target and only search those neighborhoods.

IVF parameters:

nlist - number of clusters to partition the vector space into. Higher = more precise cluster boundaries but smaller clusters (less to search per cluster). Rule of thumb: nlist ≈ sqrt(n) for n vectors. For 1M vectors: nlist=1000.

nprobe - how many clusters to search at query time. This is the recall vs. latency knob:

nprobe=1:    fastest, lowest recall (~40-60%)
nprobe=10:   balanced for many datasets (~85-90% recall)
nprobe=nlist: exact search (defeats the purpose)

When IVF beats HNSW:

Very large datasets (100M+ vectors) where HNSW memory becomes prohibitive
When you need to fit the index on disk rather than RAM
When combined with Product Quantization (IVF+PQ) for extreme compression

Product Quantization (PQ): Compression for Scale

Product Quantization compresses vectors to use dramatically less memory at the cost of some accuracy. It works by:

Splitting the vector into m sub-vectors
Independently clustering each sub-vector into k centroids (codebook)
Storing only the centroid index (typically 1 byte) instead of the full sub-vector floats

A 1024-dim float32 vector uses 4096 bytes. With PQ (m=64 sub-vectors, 256 centroids), the same information takes 64 bytes - a 64x compression ratio. This allows you to fit 100M+ vectors into RAM that would otherwise require terabytes.

IVF-PQ: Combining IVF clustering with PQ compression is the standard approach for 100M+ vector datasets. You trade significant recall (typically 80-90% vs. 95-99% for HNSW) for 64x memory reduction and 100x cost reduction at scale.

Flat Index: When Exact Search Makes Sense

The "flat" index stores raw vectors and does exact brute-force search. Use it when:

Dataset is small (under 100K vectors)
You need 100% guaranteed recall for a compliance or safety requirement
You are in a development/testing environment and want a correctness baseline
The cost of approximate search (in rare missed retrievals) exceeds the cost of brute force

import numpy as np

class FlatVectorIndex:
    """
    Exact nearest neighbor search - the baseline you compare ANN against.
    Correct but O(n*d) per query. Fine for < 100K vectors.
    """

    def __init__(self, dimensions: int):
        self.dimensions = dimensions
        self.vectors: list[np.ndarray] = []
        self.ids: list[str] = []

    def add(self, vector_id: str, vector: np.ndarray) -> None:
        assert len(vector) == self.dimensions
        self.vectors.append(vector.astype(np.float32))
        self.ids.append(vector_id)

    def search(self, query: np.ndarray, k: int) -> list[tuple[str, float]]:
        """Brute-force exact search. Returns (id, score) pairs."""
        if not self.vectors:
            return []

        # Stack all vectors into a matrix for vectorized dot product
        matrix = np.stack(self.vectors, axis=0)         # (n, d)
        q = query.astype(np.float32)                     # (d,)
        scores = matrix @ q                              # (n,) dot products

        # Get top-k indices by score
        k = min(k, len(self.vectors))
        top_indices = np.argpartition(scores, -k)[-k:]
        top_indices = top_indices[np.argsort(scores[top_indices])[::-1]]

        return [(self.ids[i], float(scores[i])) for i in top_indices]

Vector Database Landscape

Choosing a vector database is a significant architectural decision. Here is how the major options compare on the dimensions that matter.

Pinecone

Architecture: Fully managed, serverless and pod-based tiers. You never manage infrastructure.

Strengths: Easiest to get started. Strong managed scaling. Well-documented SDKs. Good integration with LangChain, LlamaIndex.

Weaknesses: Vendor lock-in. Pricing at scale can be significant (managed premium). Limited control over index parameters. Cannot run locally.

When to choose Pinecone: Your team lacks infrastructure expertise and wants to ship fast. You have a moderate-scale dataset (up to ~100M vectors). Budget is not the primary constraint.

Pricing model: Serverless charges per query and per vector. Pod-based is a fixed hourly rate per pod tier. At very high scale, calculate carefully - managed costs can exceed self-hosted costs by 5–10x.

Qdrant

Architecture: Open-source Rust-based vector database. Can self-host or use Qdrant Cloud (managed). HNSW as primary index with excellent filtering support.

Strengths: Best performance-per-dollar for self-hosted deployments. Rich filtering capabilities with minimal recall penalty. Excellent Rust-based performance. Active development. Good Python client.

Weaknesses: More operational complexity than Pinecone when self-hosted. Smaller managed ecosystem.

When to choose Qdrant: Performance matters, budget matters, your team can handle self-hosted infrastructure. The best choice for teams comfortable with Docker/Kubernetes who want control.

Production configuration example:

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

client = QdrantClient(url="http://localhost:6333")

# Create collection with production HNSW settings
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1024,
        distance=Distance.COSINE,
    ),
    hnsw_config=HnswConfigDiff(
        m=32,                  # More connections for better recall
        ef_construct=200,      # Better index quality during build
        full_scan_threshold=10000,  # Use brute force for small searches
    ),
)

Weaviate

Architecture: Open-source with managed cloud option. Multi-modal. Built-in hybrid search (dense + sparse) as a native feature.

Strengths: Best native hybrid search - BM25 + dense retrieval built in without extra configuration. Multi-modal support (text + images). GraphQL-based query interface is powerful for complex queries.

Weaknesses: More complex to configure than Qdrant or Pinecone. GraphQL adds learning curve. Memory usage is higher than Qdrant.

When to choose Weaviate: You need hybrid search out of the box. You have multi-modal data. Your team prefers a higher-level query abstraction.

pgvector

Architecture: PostgreSQL extension that adds a vector column type and ANN index support (HNSW and IVF). Queries run inside your existing PostgreSQL database.

Strengths: No new infrastructure - uses your existing PostgreSQL. Vector search and SQL filtering in the same query. ACID transactions. Perfect for teams that already run PostgreSQL and have moderate scale needs.

Weaknesses: Not designed for billion-scale vector search. Slower than purpose-built vector databases at high scale. ANN index quality (especially with IVF) can lag behind Qdrant/Pinecone.

When to choose pgvector: Your dataset is under 10M vectors. You want to do vector search alongside relational queries. Your team runs PostgreSQL and wants minimal operational complexity.

-- pgvector setup
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id         SERIAL PRIMARY KEY,
    content    TEXT NOT NULL,
    embedding  vector(1024),
    metadata   JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast ANN search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 32, ef_construction = 200);

-- Search: find top 5 most similar to a given embedding
SELECT id, content, 1 - (embedding <=> '[0.1,0.2,...]'::vector) AS score
FROM documents
ORDER BY embedding <=> '[0.1,0.2,...]'::vector
LIMIT 5;

Chroma

Architecture: Open-source, embeds in-process (can run as a server). The simplest API of any vector database.

Strengths: Easiest to get started for local development. Zero configuration. Works in a Jupyter notebook.

Weaknesses: Not designed for production scale (limited to single-machine, single-process). No managed cloud option with enterprise guarantees. Basic filtering.

When to choose Chroma: Prototyping, local development, small-scale experiments, demos. Do not use for production workloads over 100K vectors.

Milvus

Architecture: Open-source, distributed, built for 100M+ vectors. Kubernetes-native. Multiple index types (HNSW, IVF, DiskANN).

Strengths: Designed for massive scale from the ground up. Disk-based ANN index (DiskANN) for datasets too large for RAM. Strong filtering. Mature ecosystem.

Weaknesses: Highest operational complexity. Kubernetes required. Significant engineering investment to operate well.

When to choose Milvus: You have 100M+ vectors. You have a dedicated infrastructure team. Self-hosted is required for cost or compliance reasons.

Metadata Filtering: A Deep Dive

In production RAG, you almost always need to filter results - by date, by user permissions, by document type, by organization. This is where many teams encounter unexpected complexity.

Pre-filtering vs. Post-filtering

Post-filtering (naive approach): Run the ANN search to get top-k results, then discard results that do not match the filter. Simple to implement. Fatal flaw: if the filter is selective (e.g., only 1% of documents match), you may get 0 results even though there are relevant documents in the filtered set, because the ANN search chose k results from the 99% that do not match.

Pre-filtering (correct approach): Only search over vectors that match the filter criteria. Correct by construction - every returned result satisfies the filter. The challenge: effective pre-filtering requires the vector database to know which vectors match the filter before doing the ANN search.

Modern vector databases handle pre-filtering in different ways:

Qdrant: Uses a "filtered HNSW" approach - it applies the filter during graph traversal, skipping non-matching nodes. Works very well for high-selectivity filters (1–50% of the dataset). For very selective filters (< 0.1%), falls back to filtered brute-force.

Weaviate: Pre-filters using an inverted index built on metadata fields, then searches only the matching subset.

Pinecone: Supports metadata filtering with vector search. Performance depends on filter selectivity.

pgvector: Metadata filtering happens in SQL - you can add a WHERE clause before the vector search. The PostgreSQL query planner handles the integration.

Common Filter Patterns

# Example filter patterns in Qdrant query format

# User permission filter (tenant isolation)
user_filter = {
    "must": [
        {"key": "user_id", "match": {"value": "user_abc123"}}
    ]
}

# Date range filter (recency)
date_filter = {
    "must": [
        {"key": "published_at", "range": {"gte": "2024-01-01"}}
    ]
}

# Multi-condition filter (document type + organization)
compound_filter = {
    "must": [
        {"key": "org_id", "match": {"value": "org_xyz"}},
        {"key": "doc_type", "match": {"any": ["policy", "procedure"]}}
    ],
    "must_not": [
        {"key": "is_archived", "match": {"value": True}}
    ]
}

Filtering Performance Rule of Thumb

Filter selectivity (% of corpus that passes the filter) dramatically affects performance:

Selectivity	Recommended Strategy
50–100%	Pre-filter via ANN (all major DBs handle this well)
10–50%	Pre-filter via ANN (test with your specific DB)
1–10%	Pre-filter may degrade - benchmark carefully
less than 1%	Consider a separate filtered sub-index, or brute force over the filtered set

Architecture and Diagrams

HNSW Structure Visualization

Vector DB Selection Decision Tree

Pre-filtering vs. Post-filtering Comparison

Production Code

The following implementation covers vector search fundamentals, a working RAG pipeline with metadata filtering, and Claude-powered answer generation.

"""
Vector Search in Practice
==========================
Covers:
  - VectorSearchConfig: ANN index configuration
  - MockVectorDB: in-memory ANN with HNSW-like parameters
  - SearchResult: typed retrieval output
  - ProductionVectorSearcher: retry, timeout, explain
  - RAGPipeline: embedding + search + Claude generation with citations

Install: pip install anthropic numpy
In production: pip install qdrant-client or pinecone-client
"""

import anthropic
import numpy as np
import time
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger(__name__)


# ─────────────────────────────────────────────
# Configuration
# ─────────────────────────────────────────────

@dataclass
class VectorSearchConfig:
    """
    Configuration for vector search behavior.
    In production: these map to your vector DB's collection settings.
    """
    dimensions: int = 1024
    distance_metric: str = "cosine"   # "cosine", "dot", "l2"
    index_type: str = "hnsw"          # "hnsw", "ivf", "flat"
    # HNSW build parameters
    hnsw_m: int = 32                  # Connections per node
    hnsw_ef_construction: int = 200   # Build-time search depth
    # HNSW query parameters
    hnsw_ef_search: int = 100         # Query-time search depth (recall knob)
    # IVF parameters (if using IVF)
    ivf_nlist: int = 256              # Number of clusters
    ivf_nprobe: int = 16              # Clusters to search per query


# ─────────────────────────────────────────────
# Data structures
# ─────────────────────────────────────────────

@dataclass
class DocumentChunk:
    """A searchable chunk with vector, text, and metadata."""
    chunk_id: str
    document_id: str
    text: str
    embedding: list[float]
    metadata: dict = field(default_factory=dict)
    # Common metadata fields:
    # source_url, doc_type, created_at, user_id, org_id, language


@dataclass
class SearchResult:
    """A single search result from the vector database."""
    chunk: DocumentChunk
    score: float           # Cosine similarity (0 to 1 for normalized vectors)
    rank: int              # Position in result list (1-indexed)

    @property
    def relevance_label(self) -> str:
        """Human-readable relevance estimate."""
        if self.score >= 0.85:
            return "Very High"
        elif self.score >= 0.70:
            return "High"
        elif self.score >= 0.55:
            return "Medium"
        elif self.score >= 0.40:
            return "Low"
        else:
            return "Very Low"


# ─────────────────────────────────────────────
# Mock Vector Database (production: use Qdrant/Pinecone/Weaviate)
# ─────────────────────────────────────────────

class MockVectorDB:
    """
    In-memory vector database with brute-force search.

    In production, replace this with a real vector database client:

    Qdrant:
        from qdrant_client import QdrantClient
        client = QdrantClient(url="http://localhost:6333")

    Pinecone:
        from pinecone import Pinecone
        pc = Pinecone(api_key="...")
        index = pc.Index("documents")

    Weaviate:
        import weaviate
        client = weaviate.connect_to_local()

    The interface (add, search, filtered_search) mirrors these clients.
    """

    def __init__(self, config: VectorSearchConfig):
        self.config = config
        self.chunks: list[DocumentChunk] = []
        self._indexing_time_ms: list[float] = []

    def add_documents(self, chunks: list[DocumentChunk]) -> None:
        """Add document chunks to the vector store."""
        t0 = time.time()
        for chunk in chunks:
            if len(chunk.embedding) != self.config.dimensions:
                raise ValueError(
                    f"Embedding dimension mismatch: "
                    f"expected {self.config.dimensions}, "
                    f"got {len(chunk.embedding)}"
                )
            self.chunks.append(chunk)
        elapsed = (time.time() - t0) * 1000
        self._indexing_time_ms.append(elapsed)
        logger.info(f"Indexed {len(chunks)} chunks in {elapsed:.1f}ms. Total: {len(self.chunks)}")

    def _compute_scores(self, query_embedding: list[float]) -> np.ndarray:
        """Vectorized cosine similarity computation."""
        if not self.chunks:
            return np.array([])

        q = np.array(query_embedding, dtype=np.float32)
        matrix = np.stack(
            [np.array(c.embedding, dtype=np.float32) for c in self.chunks],
            axis=0,
        )
        # Dot product of normalized vectors = cosine similarity
        scores = matrix @ q
        return scores

    def search(
        self,
        query_embedding: list[float],
        k: int = 5,
        score_threshold: float = 0.0,
    ) -> list[SearchResult]:
        """
        Search for the top-k most similar chunks.

        In production (Qdrant):
            results = client.search(
                collection_name="documents",
                query_vector=query_embedding,
                limit=k,
                score_threshold=score_threshold,
            )
        """
        scores = self._compute_scores(query_embedding)
        if len(scores) == 0:
            return []

        k_actual = min(k, len(scores))
        top_indices = np.argpartition(scores, -k_actual)[-k_actual:]
        top_indices = top_indices[np.argsort(scores[top_indices])[::-1]]

        results = []
        for rank, idx in enumerate(top_indices, 1):
            score = float(scores[idx])
            if score >= score_threshold:
                results.append(SearchResult(
                    chunk=self.chunks[idx],
                    score=score,
                    rank=rank,
                ))

        return results

    def filtered_search(
        self,
        query_embedding: list[float],
        metadata_filter: dict,
        k: int = 5,
        score_threshold: float = 0.0,
    ) -> list[SearchResult]:
        """
        Pre-filtered search: only search over chunks matching the filter.

        In production (Qdrant):
            from qdrant_client.models import Filter, FieldCondition, MatchValue
            results = client.search(
                collection_name="documents",
                query_vector=query_embedding,
                query_filter=Filter(
                    must=[FieldCondition(key="org_id", match=MatchValue(value=org_id))]
                ),
                limit=k,
            )

        This mock implements pre-filtering correctly - filter first, search second.
        Never filter after search (post-filtering misses relevant results).
        """
        # Step 1: Pre-filter to matching chunks only
        matching_chunks = [
            (i, chunk)
            for i, chunk in enumerate(self.chunks)
            if self._matches_filter(chunk, metadata_filter)
        ]

        if not matching_chunks:
            logger.warning(
                f"Pre-filter returned 0 results for filter: {metadata_filter}. "
                f"Consider whether the filter is too selective."
            )
            return []

        logger.debug(
            f"Pre-filter: {len(matching_chunks)}/{len(self.chunks)} chunks match "
            f"({100 * len(matching_chunks) / len(self.chunks):.1f}% selectivity)"
        )

        # Step 2: Search only the filtered subset
        q = np.array(query_embedding, dtype=np.float32)
        scored = []
        for orig_idx, chunk in matching_chunks:
            d = np.array(chunk.embedding, dtype=np.float32)
            score = float(np.dot(q, d))
            scored.append((chunk, score))

        scored.sort(key=lambda x: x[1], reverse=True)
        results = []
        for rank, (chunk, score) in enumerate(scored[:k], 1):
            if score >= score_threshold:
                results.append(SearchResult(chunk=chunk, score=score, rank=rank))

        return results

    def _matches_filter(self, chunk: DocumentChunk, filter_dict: dict) -> bool:
        """
        Simple metadata filter matching.
        Supports: exact match, list membership, range (gte/lte).
        """
        for key, condition in filter_dict.items():
            chunk_value = chunk.metadata.get(key)
            if chunk_value is None:
                return False

            if isinstance(condition, dict):
                # Range condition: {"gte": ..., "lte": ...}
                if "gte" in condition and chunk_value < condition["gte"]:
                    return False
                if "lte" in condition and chunk_value > condition["lte"]:
                    return False
                # Set membership: {"in": [...]}
                if "in" in condition and chunk_value not in condition["in"]:
                    return False
            else:
                # Exact match
                if chunk_value != condition:
                    return False

        return True

    @property
    def stats(self) -> dict:
        return {
            "total_chunks":    len(self.chunks),
            "dimensions":      self.config.dimensions,
            "index_type":      self.config.index_type,
            "hnsw_m":          self.config.hnsw_m,
            "hnsw_ef_search":  self.config.hnsw_ef_search,
        }


# ─────────────────────────────────────────────
# Production Searcher (retry + timeout + explain)
# ─────────────────────────────────────────────

class ProductionVectorSearcher:
    """
    Production wrapper around a vector database with:
      - Retry on transient failures
      - Configurable query timeout
      - Search result explanation via Claude
      - Latency tracking
    """

    def __init__(
        self,
        vector_db: MockVectorDB,
        anthropic_api_key: str,
        max_retries: int = 3,
        timeout_seconds: float = 5.0,
    ):
        self.db = vector_db
        self.max_retries = max_retries
        self.timeout_seconds = timeout_seconds
        self.claude = anthropic.Anthropic(api_key=anthropic_api_key)
        self._query_latencies: list[float] = []

    def search_with_retry(
        self,
        query_embedding: list[float],
        k: int = 5,
        metadata_filter: Optional[dict] = None,
    ) -> list[SearchResult]:
        """Search with exponential backoff retry for transient failures."""
        for attempt in range(self.max_retries):
            try:
                t0 = time.time()

                if metadata_filter:
                    results = self.db.filtered_search(
                        query_embedding, metadata_filter, k=k
                    )
                else:
                    results = self.db.search(query_embedding, k=k)

                latency = (time.time() - t0) * 1000
                self._query_latencies.append(latency)
                logger.debug(f"Search completed in {latency:.1f}ms, {len(results)} results")
                return results

            except Exception as e:
                if attempt == self.max_retries - 1:
                    raise
                wait = 2 ** attempt
                logger.warning(f"Search failed (attempt {attempt + 1}): {e}. Retry in {wait}s")
                time.sleep(wait)

        raise RuntimeError("Search failed after all retries")

    def explain_result(
        self,
        query: str,
        result: SearchResult,
    ) -> str:
        """
        Use Claude (Haiku - cheap batch operation) to explain why a chunk
        was retrieved for a given query. Useful for debugging poor retrievals.
        """
        prompt = f"""A vector search system retrieved the following document chunk for a user query.
Explain in 1-2 sentences why this chunk was semantically relevant to the query.

Query: {query}

Retrieved chunk (relevance score: {result.score:.3f}):
{result.chunk.text}

Be specific about the semantic connection."""

        response = self.claude.messages.create(
            model="claude-haiku-4-5-20251001",  # Cheap model for utility tasks
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}],
        )
        return response.content[0].text

    @property
    def p50_latency_ms(self) -> float:
        if not self._query_latencies:
            return 0.0
        return float(np.percentile(self._query_latencies, 50))

    @property
    def p99_latency_ms(self) -> float:
        if not self._query_latencies:
            return 0.0
        return float(np.percentile(self._query_latencies, 99))


# ─────────────────────────────────────────────
# Complete RAG Pipeline
# ─────────────────────────────────────────────

class VectorRAGPipeline:
    """
    Production RAG pipeline combining vector search with Claude generation.

    Flow:
      1. Embed the user query (via your embedding model)
      2. Search the vector DB (with optional metadata filters)
      3. Format retrieved chunks as context
      4. Generate an answer with Claude, including citations
    """

    def __init__(
        self,
        searcher: ProductionVectorSearcher,
        anthropic_api_key: str,
        embedding_fn,  # Callable[[str], list[float]]
    ):
        self.searcher = searcher
        self.embedding_fn = embedding_fn
        self.claude = anthropic.Anthropic(api_key=anthropic_api_key)

    def retrieve(
        self,
        query: str,
        k: int = 5,
        metadata_filter: Optional[dict] = None,
    ) -> list[SearchResult]:
        """Embed query and retrieve top-k relevant chunks."""
        query_embedding = self.embedding_fn(query)
        return self.searcher.search_with_retry(
            query_embedding, k=k, metadata_filter=metadata_filter
        )

    def generate_with_citations(
        self,
        query: str,
        retrieved: list[SearchResult],
    ) -> dict:
        """
        Generate a cited answer using Claude.

        Uses structured XML context for reliable citation extraction.
        Claude is instructed to reference specific document indices.
        """
        if not retrieved:
            return {
                "answer": "No relevant documents found for this query.",
                "citations": [],
            }

        # Build XML-structured context
        context_blocks = []
        for result in retrieved:
            block = (
                f'<document id="{result.chunk.chunk_id}" '
                f'source="{result.chunk.document_id}" '
                f'relevance="{result.score:.3f}">\n'
                f'{result.chunk.text}\n'
                f'</document>'
            )
            context_blocks.append(block)

        context = "\n\n".join(context_blocks)

        prompt = f"""You are a precise research assistant. Answer the question below using ONLY the provided documents.

Rules:
- Cite sources by referencing the document id attribute, e.g., [doc_001_chunk_0]
- If multiple documents support a claim, cite all of them
- If the documents don't contain enough information, say exactly that
- Be concise but complete

<context>
{context}
</context>

<question>{query}</question>

Provide your answer with inline citations."""

        response = self.claude.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )

        return {
            "answer": response.content[0].text,
            "citations": [
                {
                    "chunk_id":  r.chunk.chunk_id,
                    "document":  r.chunk.document_id,
                    "score":     round(r.score, 4),
                    "relevance": r.relevance_label,
                    "preview":   r.chunk.text[:120] + "...",
                }
                for r in retrieved
            ],
            "retrieval_stats": {
                "chunks_retrieved": len(retrieved),
                "top_score":  round(retrieved[0].score, 4) if retrieved else 0.0,
                "p50_search_ms": self.searcher.p50_latency_ms,
            },
        }

    def answer(
        self,
        query: str,
        k: int = 5,
        metadata_filter: Optional[dict] = None,
    ) -> dict:
        """Full RAG pipeline: retrieve + generate."""
        t0 = time.time()
        retrieved = self.retrieve(query, k=k, metadata_filter=metadata_filter)
        result = self.generate_with_citations(query, retrieved)
        result["total_latency_ms"] = round((time.time() - t0) * 1000, 1)
        return result


# ─────────────────────────────────────────────
# ANN Parameter Demonstration
# ─────────────────────────────────────────────

def demonstrate_ann_parameters():
    """
    Show how ef_search affects the recall vs. latency trade-off.
    In a real system, these parameters map directly to your vector DB config.
    """
    print("=== ANN Parameter Trade-off Demo ===\n")

    np.random.seed(42)
    n_vectors = 10_000
    dimensions = 128
    k = 10

    # Generate random unit vectors (simulating normalized embeddings)
    vectors = np.random.randn(n_vectors, dimensions).astype(np.float32)
    vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)

    query = np.random.randn(dimensions).astype(np.float32)
    query /= np.linalg.norm(query)

    # Ground truth: exact brute-force top-k
    t0 = time.time()
    exact_scores = vectors @ query
    exact_top_k = set(np.argpartition(exact_scores, -k)[-k:].tolist())
    brute_force_ms = (time.time() - t0) * 1000

    print(f"Brute force (exact): {brute_force_ms:.2f}ms, recall=100%\n")

    # Simulate ANN with different ef_search values
    # (real HNSW would be faster - this shows the concept)
    print("ef_search | Simulated latency | Recall@10")
    print("-" * 45)

    for ef_search in [16, 32, 64, 128, 256]:
        # Simulate: sample ef_search candidates, compute scores on those only
        t0 = time.time()
        candidate_indices = np.random.choice(n_vectors, size=min(ef_search * 4, n_vectors), replace=False)
        candidate_scores = exact_scores[candidate_indices]
        top_candidates = candidate_indices[np.argpartition(candidate_scores, -k)[-k:]]
        approx_ms = (time.time() - t0) * 1000 + (ef_search * 0.01)

        # Measure recall vs. exact
        approx_set = set(top_candidates.tolist())
        recall = len(exact_top_k & approx_set) / k

        print(f"    {ef_search:4d}  |   {approx_ms:6.2f}ms          | {recall:.1%}")

    print(f"\nConclusion: higher ef_search → better recall, slower queries.")
    print("Choose ef_search based on your latency SLA and recall requirements.")


def demonstrate_filtering():
    """Show pre-filtering vs. post-filtering behavior."""
    print("\n=== Pre-filtering vs. Post-filtering Demo ===\n")

    config = VectorSearchConfig(dimensions=4)
    db = MockVectorDB(config)

    # Create documents: 100 for org_A, 900 for org_B
    chunks = []
    np.random.seed(42)

    for i in range(100):
        vec = np.random.randn(4).astype(np.float32)
        vec /= np.linalg.norm(vec)
        chunks.append(DocumentChunk(
            chunk_id=f"a_{i}",
            document_id=f"doc_a_{i}",
            text=f"Organization A document {i}: confidential data",
            embedding=vec.tolist(),
            metadata={"org_id": "org_A", "doc_type": "confidential"},
        ))

    for i in range(900):
        vec = np.random.randn(4).astype(np.float32)
        vec /= np.linalg.norm(vec)
        chunks.append(DocumentChunk(
            chunk_id=f"b_{i}",
            document_id=f"doc_b_{i}",
            text=f"Organization B document {i}: public data",
            embedding=vec.tolist(),
            metadata={"org_id": "org_B", "doc_type": "public"},
        ))

    db.add_documents(chunks)

    query_vec = np.random.randn(4).astype(np.float32)
    query_vec /= np.linalg.norm(query_vec)

    # Post-filtering (wrong approach): search all, filter after
    all_results = db.search(query_vec.tolist(), k=5)
    org_a_in_top5 = [r for r in all_results if r.chunk.metadata["org_id"] == "org_A"]
    print(f"Post-filtering: searched all 1000, top-5 contained {len(org_a_in_top5)} org_A docs")
    print(f"  Risk: {5 - len(org_a_in_top5)} results were org_B and got filtered out - we wasted them")

    # Pre-filtering (correct approach): filter first, search filtered subset
    filtered_results = db.filtered_search(
        query_vec.tolist(),
        metadata_filter={"org_id": "org_A"},
        k=5,
    )
    print(f"\nPre-filtering: searched only org_A subset (100 docs), got {len(filtered_results)} results")
    all_org_a = all(r.chunk.metadata["org_id"] == "org_A" for r in filtered_results)
    print(f"  All results are org_A: {all_org_a}")
    print(f"  Correct behavior: tenant isolation guaranteed")


if __name__ == "__main__":
    demonstrate_ann_parameters()
    demonstrate_filtering()

Scaling Considerations

Memory Budgeting

Before deploying, calculate your memory requirements:

def estimate_memory_gb(
    num_vectors: int,
    dimensions: int,
    index_type: str = "hnsw",
    hnsw_m: int = 32,
) -> dict:
    """Estimate RAM requirements for a vector index."""
    bytes_per_float = 4
    raw_bytes = num_vectors * dimensions * bytes_per_float
    raw_gb = raw_bytes / (1024 ** 3)

    if index_type == "hnsw":
        # HNSW overhead: raw vectors + graph structure
        # Graph: approximately M * 8 bytes per node (link storage)
        graph_bytes = num_vectors * hnsw_m * 8 * 2  # bidirectional
        graph_gb = graph_bytes / (1024 ** 3)
        total_gb = raw_gb + graph_gb
    elif index_type == "ivf":
        # IVF: raw vectors + centroid storage (small)
        total_gb = raw_gb * 1.05
    elif index_type == "flat":
        total_gb = raw_gb  # No index overhead

    return {
        "num_vectors":  f"{num_vectors:,}",
        "raw_vectors_gb":   round(raw_gb, 2),
        "total_with_index_gb": round(total_gb, 2),
        "recommended_server_ram_gb": round(total_gb * 1.4, 1),  # 40% headroom
    }

for n in [100_000, 1_000_000, 10_000_000]:
    est = estimate_memory_gb(n, 1024, "hnsw", hnsw_m=32)
    print(f"{est['num_vectors']} vectors: {est['total_with_index_gb']} GB "
          f"(need {est['recommended_server_ram_gb']} GB server)")

Distributed Search: When One Machine Is Not Enough

When your vector index exceeds available RAM on a single server, you need distributed search:

Horizontal sharding: Split your vector collection across multiple servers (shards), each holding a subset of vectors. Each query goes to all shards in parallel, and results are merged and re-ranked. Both Qdrant and Milvus support this natively.

Disk-based indexes: DiskANN (used by Milvus) stores the vector index on disk rather than RAM, enabling 100M+ vector indexes on a single machine with ~16GB RAM. Latency increases (disk access), but cost drops dramatically.

Quantization: IVF+PQ compresses vectors 32–128x, enabling 1B+ vector indexes in RAM that would otherwise require terabytes.

Index Build Time vs. Query Time

There is an inherent tension in HNSW between build quality and build time:

ef_construction	Build time (1M vectors)	Query latency	Recall@10
64	~15 min	8ms	93%
100	~25 min	8ms	96%
200	~45 min	8ms	98%
400	~90 min	8ms	99%+

Build-time parameters are set once. Query-time parameters (ef_search) can be tuned at runtime. Build with high ef_construction (200+) and tune ef_search per your latency SLA.

Common Mistakes

:::danger Using Brute Force at Scale Exact nearest neighbor search is O(n) per query. What works at 10K documents will be 100x slower at 1M documents. Always use ANN indexing (HNSW or IVF) for production datasets above 50K vectors. Add performance benchmarks to your staging environment before launch. :::

:::danger Post-filtering Breaks Recall Post-filtering (search everything, then discard non-matching results) silently fails when filters are selective. If only 1% of your corpus matches a filter, a top-5 post-filtered search might return 0 results even though there are relevant documents. Always use pre-filtering support in your vector database. :::

:::warning Index Type Mismatch for Your Scale HNSW is excellent for up to ~50M vectors. For 100M+, you need DiskANN or IVF+PQ. Choosing HNSW at 500M vectors means spending $50K+/month on RAM. Profile your expected scale before committing to an index type. :::

:::tip Tune ef_search Per Query Type Complex analytical questions deserve higher ef_search (slower, better recall). Simple factual lookups can use lower ef_search (faster, slightly lower recall). Some vector databases allow per-query ef_search tuning - use it. :::

:::tip Score Thresholds Prevent Noise Always set a minimum score threshold (e.g., 0.40) to prevent low-relevance chunks from reaching the generation step. A chunk with 0.20 cosine similarity is probably noise. Passing noise to Claude wastes tokens and degrades answer quality. Start with a threshold of 0.50 and tune based on your domain. :::

:::warning Dimension Counts Affect Search Speed HNSW search time scales with vector dimensions. A 3072-dim index is roughly 3x slower to search than a 1024-dim index for the same dataset size. Budget for this when comparing embedding models - the quality gain from higher dimensions may not offset the latency cost. :::

Interview Questions and Answers

Q1: Explain why brute-force vector search breaks at scale and how HNSW solves it.

Brute-force search computes the dot product between the query vector and every vector in the database. The computation is $O(n \cdot d)$ - linear in both the number of vectors and dimensions. At 1M vectors and 1024 dimensions, that is 1 billion floating-point operations per query. On a modern CPU, that takes roughly 4 seconds. HNSW solves this by building a layered graph index during ingestion. The graph has multiple layers: the top layers are sparse with long-range connections (for coarse navigation), and the bottom layer is dense with local connections (for fine-grained search). A query starts at the top, greedily navigates toward the closest known node at each layer, and descends to find the local neighborhood at the bottom. The expected search time is $O(\log n)$ rather than $O(n)$ . The trade-off: HNSW is approximate - it may miss a small fraction (1–5%) of the truly nearest neighbors depending on the ef_search parameter. For RAG, this trade-off is almost always worth it.

Q2: How would you choose between Pinecone, Qdrant, and pgvector for a production RAG system?

The decision depends on three factors: scale, operational expertise, and existing infrastructure. Pinecone is right when you want zero infrastructure management and can afford the managed premium - it is the fastest path to production and handles scaling automatically. Qdrant is right when performance-per-dollar matters and your team can operate Docker containers or Kubernetes - it is open-source, written in Rust for maximum performance, and has excellent filtering support. pgvector is right when your dataset is under 5M vectors and you are already running PostgreSQL - you get vector search with zero new infrastructure, and you can join vector search results with your relational data in a single SQL query. For most early-stage teams: start with Qdrant Cloud or Pinecone serverless, measure your actual scale, then make a more informed decision at 10M+ vectors.

Q3: A user can only access their own organization's documents. How do you implement this in a vector database?

This is a pre-filtering problem. Every document chunk is stored with an org_id metadata field. When a user makes a query, you inject a mandatory filter: must_match(org_id == user.org_id). The vector database applies this filter before or during ANN search, ensuring only the user's organization's documents are considered. The critical implementation detail: this must be pre-filtering, not post-filtering. Post-filtering searches all documents (including other organizations' data) and then discards non-matching results - this is both a security risk (you retrieved other orgs' data) and a correctness problem (selective filters starve your top-k results). In Qdrant, use the Filter object with must conditions. In Pinecone, use metadata filter dicts. In pgvector, add a WHERE org_id = $1 clause. Never rely on post-filtering for security-sensitive isolation.

Q4: What is the ef_search parameter in HNSW and how do you tune it?

ef_search controls how many candidate nodes HNSW explores during a search query. It is the primary knob for trading query latency against retrieval recall. A low ef_search (e.g., 16) means the search terminates quickly but may miss some of the truly nearest neighbors. A high ef_search (e.g., 400) means the search explores more of the graph, approaching exact search quality but at higher latency. To tune it: run your benchmark queries at multiple ef_search values, measure both P99 latency and Recall@10 (percentage of ground-truth top-10 results returned). Plot the recall-latency curve. Choose the ef_search where the latency meets your SLA (e.g., P99 under 50ms) while maximizing recall. For most RAG applications, ef_search in the range of 64–128 achieves 95–98% recall at acceptable latency.

Q5: How would you handle a RAG system where documents need to be updated in real-time?

HNSW was originally designed for static datasets - adding vectors after index build can degrade index quality over time. Modern vector databases handle this differently. Qdrant supports real-time updates with background index optimization - you can insert new vectors immediately and they are searchable, but the index optimizer runs asynchronously to maintain index quality. Pinecone's serverless tier is designed for real-time updates. For pgvector, updates are just SQL INSERTs and the index is always current. The key operational consideration: if your update rate is very high (thousands per second), monitor your vector DB's write throughput and index freshness metrics. Some systems have an "upsert lag" - time between write and searchability. For most RAG systems with document-level update rates (not sensor/event data), real-time updates are straightforward with any modern vector database.

Q6: How much RAM does a production vector index require? Walk me through the calculation.

For HNSW at 1 million vectors, 1024 dimensions:

Raw vector storage: 1M × 1024 × 4 bytes = 4GB. The HNSW graph structure adds roughly M × 8 bytes per node bidirectionally. With M=32: 1M × 32 × 8 × 2 = 512MB for the graph. Total: approximately 4.5GB. Add 40% headroom for the OS, query processing buffers, and metadata: you need about 6–7GB of RAM for comfortable operation. At 10M vectors and the same dimensions, multiply by 10: roughly 60–70GB - you need a dedicated server with 64–128GB RAM. At 100M vectors, HNSW memory (400–600GB) becomes prohibitive for most teams; that is when you consider IVF+PQ (which can compress to 10–15GB through quantization) or DiskANN (which moves the index to disk).

Summary

Vector search is where the theoretical elegance of embeddings meets the practical reality of production systems. The key principles:

Never use brute force at scale - switch to HNSW or IVF before you hit 50K vectors.
Understand your index trade-offs - HNSW for medium scale, IVF+PQ for massive scale.
Always pre-filter, never post-filter - for security isolation and correctness.
Choose your vector database based on your actual constraints - scale, team expertise, existing infrastructure.
Tune ef_search at runtime - it is the primary recall vs. latency knob and does not require rebuilding the index.
Budget RAM carefully - a 1024-dim HNSW index for 10M vectors needs 60GB+ of RAM.

The team from the opening story was not wrong to build their system - they were wrong not to load-test at realistic scale before launch. ANN indexing is not an optimization. It is a requirement for any vector search system that will grow beyond 50K documents.

From 50ms to 8 Seconds​

Vector Search Fundamentals​

Exact vs. Approximate: The Core Trade-off​

Distance Metrics: Which One and When​

Top-k: What k Should You Use?​

ANN Index Types: A Deep Dive​

HNSW: The Modern Default​

IVF: Inverted File Index​

Product Quantization (PQ): Compression for Scale​

Flat Index: When Exact Search Makes Sense​

Vector Database Landscape​

Pinecone​

Qdrant​

Weaviate​

pgvector​

Chroma​

Milvus​

Metadata Filtering: A Deep Dive​

Pre-filtering vs. Post-filtering​

Common Filter Patterns​

Filtering Performance Rule of Thumb​

Architecture and Diagrams​

HNSW Structure Visualization​

Vector DB Selection Decision Tree​

Pre-filtering vs. Post-filtering Comparison​

Production Code​

Scaling Considerations​

Memory Budgeting​

Distributed Search: When One Machine Is Not Enough​

Index Build Time vs. Query Time​

Common Mistakes​

Interview Questions and Answers​

Summary​