What is vector database scaling?

Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.

How does horizontal sharding work in practice?

Scaling Vector Databases to Billions of Vectors covers vector database scaling, horizontal sharding, consistent hashing from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/vector-database-engineering/faiss-for-self-hosted-search

What is the difference between vector database scaling and consistent hashing?

See the full breakdown at https://engineersofai.com/docs/ai-systems/vector-database-engineering/faiss-for-self-hosted-search

:::tip 🎮 Interactive Playground Visualize this concept: Try the FAISS Index Types demo on the EngineersOfAI Playground - no code required. :::

Scaling Vector Databases to Billions of Vectors

The Billion-Vector Problem

The engineering lead at a large e-commerce platform opens a capacity planning document. Their product catalog vector search is at 50 million embeddings and growing at 2 million per week. At current growth, they will hit 300 million embeddings in a year. Their current single-node Qdrant instance has 256 GB RAM - enough for 50 million 768-dimensional float32 vectors (150 GB) plus HNSW graph overhead, but nowhere near enough for 300 million.

The naive solution - buy a larger machine - hits a ceiling. The largest practical server has 12 TB RAM. At 768 dimensions with float32, that holds about 4 billion vectors, but costs $50,000/month on cloud. And at 5 billion vectors, even that limit fails.

The real solution requires a fundamentally different architecture: sharding the data across multiple nodes so that the total dataset size is unbounded. But vector database sharding has challenges that relational database sharding does not: queries must span shards (since the nearest neighbor could be on any shard), HNSW graph properties degrade when graphs are split, and geographic distribution adds consistency complexity that does not exist in single-region deployments.

This lesson gives you the complete architecture for vector database deployments that scale horizontally without degrading recall or exploding latency.

Why Scaling Is Different for Vector Databases

Relational databases shard by key range or hash. A query for a specific user ID goes to exactly one shard - you know which shard before executing the query. For vector search, you do not know where the nearest neighbor is before executing the search. The nearest neighbor to a query vector could be on any shard. This means every search query must run on every shard and the results must be merged.

This changes the cost model: doubling the number of shards doubles the compute cost per query, not halves it (as it would for relational sharding). Horizontal scaling for vector databases trades memory efficiency for compute cost. You must choose shard count carefully based on data size, not query load.

The corollary: unlike relational databases where you can scale out to handle more queries, vector database sharding is primarily a memory capacity strategy, not a throughput strategy. For throughput, you add replicas; for capacity, you add shards.

Horizontal Sharding Strategies

Random Hash Sharding

Assign each vector to a shard based on a hash of its ID:

$\text{shard}(\text{id}) = \text{hash}(\text{id}) \mod N_{\text{shards}}$

Every query searches all $N$ shards and merges the top-K results. This produces the most uniform data distribution but forces a full scatter-gather on every query.

Best for: General-purpose scaling where you cannot predict query patterns, or when all queries need global search (no filtering that would eliminate most shards).

Semantic Clustering Sharding

Partition vectors by semantic region: run k-means clustering on a representative sample of the corpus, assign each shard to one or more centroids, and route each new vector to the shard whose centroid it is closest to.

For a query, identify the $n_{probe}$ nearest centroids and only search those shards. This enables queries to touch a fraction of shards rather than all shards.

Best for: Applications where queries tend to have semantic locality - most relevant results cluster in semantic regions. Reduces cross-shard query fan-out at the cost of index build complexity and potential hot spots (popular semantic regions over-fill a shard).

Tenant-Based Sharding

For multi-tenant applications: assign each tenant (or tenant group) to a shard. Queries are tenant-scoped and only touch one shard.

Best for: Multi-tenant SaaS where per-tenant isolation is already required. Zero cross-shard fan-out since every query knows its tenant's shard.

import hashlib
import numpy as np
from typing import List, Dict, Tuple

class VectorShardRouter:
    """
    Routes vectors and queries to the appropriate shard(s).
    """
    def __init__(self, n_shards: int, strategy: str = "hash"):
        self.n_shards = n_shards
        self.strategy = strategy

    def get_write_shard(self, vector_id: str) -> int:
        """Determine which shard to write a vector to."""
        if self.strategy == "hash":
            return int(hashlib.md5(vector_id.encode()).hexdigest(), 16) % self.n_shards
        raise ValueError(f"Unknown strategy: {self.strategy}")

    def get_query_shards(self, query_metadata: dict | None = None) -> List[int]:
        """
        Determine which shards to query.
        Hash strategy: all shards. Tenant strategy: one shard.
        """
        if self.strategy == "hash":
            return list(range(self.n_shards))
        elif self.strategy == "tenant" and query_metadata:
            tenant_id = query_metadata.get("tenant_id", "")
            shard = int(hashlib.md5(tenant_id.encode()).hexdigest(), 16) % self.n_shards
            return [shard]
        return list(range(self.n_shards))


class DistributedVectorSearch:
    """
    Scatter-gather vector search across multiple shards.
    Each shard is a separate Qdrant (or other vector DB) instance.
    """
    def __init__(self, shard_clients: List, router: VectorShardRouter):
        self.shards = shard_clients
        self.router = router

    async def search(
        self,
        query_vector: np.ndarray,
        k: int = 10,
        query_metadata: dict = None,
        k_per_shard: int = None,
    ) -> List[dict]:
        """
        Scatter query to relevant shards, gather and merge results.
        k_per_shard should be > k to ensure top-k is found after merge.
        """
        import asyncio

        if k_per_shard is None:
            k_per_shard = k * 2  # retrieve 2x per shard to compensate for merge loss

        target_shards = self.router.get_query_shards(query_metadata)

        # Scatter: search all target shards in parallel
        tasks = [
            self._search_shard(
                shard_id=shard_id,
                query_vector=query_vector,
                k=k_per_shard,
                query_metadata=query_metadata,
            )
            for shard_id in target_shards
        ]

        shard_results = await asyncio.gather(*tasks, return_exceptions=True)

        # Handle partial failures: if a shard fails, log and continue
        all_results = []
        for shard_id, result in zip(target_shards, shard_results):
            if isinstance(result, Exception):
                print(f"ERROR: Shard {shard_id} failed: {result}")
                # Degrade gracefully: continue with results from other shards
                continue
            all_results.extend(result)

        # Gather: sort all results by score, return top-k
        all_results.sort(key=lambda x: x["score"], reverse=True)
        return all_results[:k]

    async def _search_shard(
        self,
        shard_id: int,
        query_vector: np.ndarray,
        k: int,
        query_metadata: dict = None,
    ) -> List[dict]:
        """Search a single shard. Override with actual client call."""
        # Implementation depends on vector DB client
        # Example: Qdrant async client
        client = self.shards[shard_id]
        results = await client.search(
            collection_name="documents",
            query_vector=query_vector.tolist(),
            limit=k,
        )
        return [{"id": r.id, "score": r.score, "shard_id": shard_id} for r in results]

Replication for High Availability

Sharding addresses data volume. Replication addresses availability and read throughput.

Each shard should have 2–3 replicas. A write goes to all replicas of the shard; a read can go to any replica. If one replica fails, others continue serving.

Qdrant native sharding/replication:

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

# Qdrant handles sharding and replication natively
client = QdrantClient(url="http://qdrant-cluster:6333")

# Create collection with distributed storage
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE,
    ),
    shard_number=4,       # number of shards across cluster nodes
    replication_factor=2, # each shard has 2 copies
    write_consistency_factor=1,  # write acknowledged after 1 replica confirms
    # For stronger consistency: write_consistency_factor=2
)

Hot-Cold Tiering

Not all vectors are queried equally. Recent documents are queried far more often than old ones. Product catalog vectors for discontinued products are almost never queried. Storing everything on expensive RAM-backed HNSW is wasteful.

Hot-cold tiering stores recent/popular vectors in RAM (hot tier) and old/infrequently-accessed vectors on SSD or object storage (cold tier).

Tiering Strategy

from datetime import datetime, timedelta

class HotColdTierManager:
    """
    Manages migration of vectors between hot (RAM/HNSW) and cold (SSD/DiskANN) tiers.
    """
    def __init__(
        self,
        hot_client,   # Qdrant in-memory collection
        cold_client,  # Qdrant on-disk collection or DiskANN
        hot_retention_days: int = 90,
        hot_capacity_gb: float = 200.0,
    ):
        self.hot_client = hot_client
        self.cold_client = cold_client
        self.hot_retention_days = hot_retention_days
        self.hot_capacity_gb = hot_capacity_gb

    def should_migrate_to_cold(self, vector_metadata: dict) -> bool:
        """Determine if a vector should be moved to cold tier."""
        # Time-based: vectors older than retention period
        created_at = datetime.fromisoformat(vector_metadata.get("created_at", ""))
        age_days = (datetime.utcnow() - created_at).days
        if age_days > self.hot_retention_days:
            return True

        # Access-based: vectors not queried in last N days
        last_accessed = vector_metadata.get("last_accessed_at")
        if last_accessed:
            last_access = datetime.fromisoformat(last_accessed)
            if (datetime.utcnow() - last_access).days > 30:
                return True

        return False

    async def search_across_tiers(
        self,
        query_vector: list,
        k: int = 10,
        search_cold: bool = True,
    ) -> list:
        """
        Search hot tier always; search cold tier when needed.
        Cold tier search adds latency - only do it when hot results are insufficient.
        """
        # Always search hot tier
        hot_results = self.hot_client.search(
            collection_name="hot_documents",
            query_vector=query_vector,
            limit=k,
        )

        if not search_cold or len(hot_results) >= k:
            return hot_results[:k]

        # Search cold tier to fill remaining slots
        cold_results = self.cold_client.search(
            collection_name="cold_documents",
            query_vector=query_vector,
            limit=k - len(hot_results),
        )

        # Merge and re-rank
        all_results = list(hot_results) + list(cold_results)
        all_results.sort(key=lambda x: x.score, reverse=True)
        return all_results[:k]

Memory Tiering in Qdrant

Qdrant supports on-disk storage natively, enabling RAM-SSD tiering within a single collection:

from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    OptimizersConfigDiff
)

# Collection with on-disk vectors (SSD) but in-memory HNSW graph
# Good balance: graph traversal is fast (RAM), vector fetches from SSD
client.create_collection(
    collection_name="large_documents",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE,
        on_disk=True,  # store raw vectors on disk, not RAM
    ),
    hnsw_config=HnswConfigDiff(
        on_disk=False,  # keep HNSW graph in RAM for fast traversal
        m=16,
        ef_construct=200,
    ),
    optimizers_config=OptimizersConfigDiff(
        memmap_threshold=10_000,  # vectors above this count use mmap
    ),
)

Consistent Hashing for Dynamic Shard Counts

Standard modulo hashing breaks when you add or remove shards: hash(id) % (N+1) reassigns most vectors to different shards, requiring a full data migration. Consistent hashing solves this.

import hashlib
import bisect

class ConsistentHashRing:
    """
    Consistent hashing ring for vector shard assignment.
    Adding/removing shards only migrates ~1/N of vectors.
    """
    def __init__(self, nodes: List[str], virtual_nodes: int = 150):
        """
        virtual_nodes: higher = more uniform distribution
        """
        self.ring = {}
        self.sorted_keys = []
        self.virtual_nodes = virtual_nodes

        for node in nodes:
            self.add_node(node)

    def add_node(self, node: str) -> None:
        """Add a new shard to the ring. Only migrates ~1/N of vectors."""
        for i in range(self.virtual_nodes):
            key = self._hash(f"{node}:{i}")
            self.ring[key] = node
            bisect.insort(self.sorted_keys, key)

    def remove_node(self, node: str) -> None:
        """Remove a shard. Its vectors must be migrated to adjacent shard."""
        for i in range(self.virtual_nodes):
            key = self._hash(f"{node}:{i}")
            if key in self.ring:
                del self.ring[key]
                self.sorted_keys.remove(key)

    def get_node(self, key: str) -> str:
        """Get the shard responsible for this key."""
        if not self.ring:
            raise RuntimeError("No nodes in ring")
        hash_key = self._hash(key)
        idx = bisect.bisect(self.sorted_keys, hash_key)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

Capacity Planning

Before choosing shard count, calculate the memory requirements:

Component	Formula	Example (10M vectors, d=768)
Raw vectors (float32)	n × d × 4 bytes	30.0 GB
HNSW graph (M=16)	n × M × 2 × 4 bytes	1.28 GB
Payload storage	n × avg_payload_size	1.0 GB (100 bytes/doc)
OS / buffer overhead	× 1.3–1.5	9.8 GB
Total		~42 GB per 10M vectors

For a 48 GB RAM machine (e.g., r7g.2xlarge): comfortable for 10M vectors, tight for 12M.

Number of shards needed for $N$ total vectors with $M_{per\_node}$ GB per node:

$n_{\text{shards}} = \lceil N \times 42 \text{ GB} / (10M \times M_{\text{per\_node}} \times 0.7) \rceil$

The 0.7 factor provides a 30% headroom for index builds, payload indexes, and growth.

Geographic Distribution

For globally distributed applications, you want vector search to happen near the user to minimize latency. Geographic distribution adds complexity:

Replicate full index to each region. All writes go to a primary region; async replication to secondary regions. Search is read-only so consistency lag is acceptable (usually 1–5 seconds).
Regional sharding. Store European users' data in EU region, US users' data in US region. Every query goes to the user's region only. Simple, but doesn't work if users search across all data.
Hierarchical search. Search local region first; if insufficient results, fan out to other regions. Adds complexity but provides good balance.

Backup and Restore

def backup_qdrant_collection(
    client: QdrantClient,
    collection_name: str,
    backup_path: str,
) -> None:
    """
    Snapshot a Qdrant collection to disk for backup/migration.
    Snapshots are consistent (point-in-time) and can be restored atomically.
    """
    snapshot = client.create_snapshot(collection_name=collection_name)
    print(f"Snapshot created: {snapshot.name}")
    # Download snapshot to backup_path using the Qdrant REST API
    # GET /collections/{collection_name}/snapshots/{snapshot_name}

def restore_qdrant_collection(
    client: QdrantClient,
    collection_name: str,
    snapshot_path: str,
) -> None:
    """Restore collection from snapshot."""
    client.recover_snapshot(
        collection_name=collection_name,
        location=snapshot_path,
    )

Backup strategy:

Daily snapshots stored in object storage (S3/GCS) - retained for 30 days
Snapshot before index rebuilds - enables rollback if rebuild degrades recall
Replicas are NOT backups - a logical corruption (bad data ingested) propagates to all replicas; only snapshots protect against this

Production Engineering Notes

Shard count is not easily changed. Adding shards after initial deployment requires re-distributing data. Consistent hashing helps, but it is still a significant operation. Over-shard initially by 2× your current needs to avoid frequent re-sharding.

Test failure scenarios in staging. Kill a replica and measure whether queries still complete. Kill a shard leader and measure how long failover takes. Most teams discover failover behavior during production incidents, not staging.

Monitor per-shard load distribution. With semantic sharding, some semantic regions will be more popular than others, creating hot shards. Monitor QPS per shard and rebalance sharding strategy if one shard handles 40%+ of queries.

Common Mistakes

:::danger Sharding without testing scatter-gather latency Sharding multiplies query latency: instead of one ANN search, you do N ANN searches in parallel plus a merge step. A single-shard query at 15ms becomes a 4-shard query at 20–25ms (parallel execution plus merge overhead) plus a 95th-percentile tail-latency spike when any one shard is slow. Always load test the scatter-gather pattern with realistic shard counts and query concurrency. :::

:::warning Treating replicas as a substitute for backups Replicas are for availability, not durability. If a bug in your ingestion pipeline writes corrupted vectors, all replicas receive the corrupt data. A snapshot taken before the corruption gives you a clean restore point. Schedule daily snapshots to object storage, test restore procedures quarterly. :::

:::tip Start with fewer larger shards The temptation is to shard aggressively "for future scale." But each additional shard adds scatter-gather overhead, operational complexity, and failure surface area. Start with 2–4 large shards and add more only when memory pressure or query latency makes it necessary. It is easier to shard a 2-shard cluster than to un-shard an over-sharded 16-shard cluster. :::

Interview Questions

Q1: How does sharding a vector database differ from sharding a relational database?

Relational database sharding is selective: a query for user_id=42 goes to exactly one shard. You know the target shard before executing. For vector search, you do not know where the nearest neighbor is - it could be on any shard. This requires scatter-gather: the query runs on ALL shards in parallel, each shard returns its local top-K, and the results are merged globally. Doubling shards doubles compute cost per query rather than halving it. Sharding a vector database is primarily a memory capacity strategy, not a throughput strategy. For throughput, you add replicas within a shard, not additional shards.

Q2: Explain hot-cold tiering for vector databases. When is it worth the complexity?

Hot-cold tiering stores recently-accessed or recently-ingested vectors in fast RAM-backed storage (hot tier) and old or infrequently-accessed vectors on SSD or disk (cold tier). It is worth the complexity when your access pattern follows a power law: 20% of vectors receive 80% of queries, and those vectors are predictable (e.g., by recency). For a news search system where articles older than 90 days receive less than 5% of searches, tiering 90% of the data to disk cuts RAM requirements by 90% while serving the hot 10% at full HNSW speed. The complexity cost: more complex query routing, potential latency spike when a user queries cold data, and migration logic to move vectors between tiers.

Q3: Why use consistent hashing for vector shard assignment instead of simple modulo hashing?

Modulo hashing: shard = hash(id) % N. When you add the (N+1)th shard, almost all vectors are assigned to different shards - a massive data migration. Consistent hashing places nodes on a hash ring. Adding a new shard only takes ownership of approximately 1/N of the vectors from adjacent positions on the ring. Removing a shard migrates only those vectors to the adjacent shard. For a 10-shard cluster, adding one shard migrates ~10% of data rather than ~90%. This makes cluster scaling operationally feasible without full data migration.

Q4: Design a backup strategy for a production vector database with 500M vectors.

Tier 1 - continuous protection: Qdrant's write-ahead log (WAL) for recovery from node crash within the last few minutes. Tier 2 - daily snapshots: nightly Qdrant snapshots stored to S3 with 30-day retention. A 500M vector collection at 768 dims ≈ 1.5 TB uncompressed; compressed snapshots are typically 30–50% smaller. Snapshot creation takes 15–30 minutes and is non-blocking (can run while collection serves traffic). Tier 3 - pre-operation snapshots: before any bulk re-indexing, model upgrade, or major schema change, take a manual snapshot. This is your rollback point. Tier 4 - restore testing: quarterly restore drill to a staging environment - verify restore completes in under 4 hours and recall@10 matches the primary.

Q5: Your search latency at 4 shards is 25ms p50. Adding 4 more shards (8 total) raises latency to 40ms. Why might this happen even though each shard is faster?

Several compounding effects: (1) Coordination overhead: with 8 shards, the scatter-gather coordinator makes 8 parallel RPCs instead of 4, and the merge step processes 2× as many candidates; (2) Tail latency amplification: the overall query only completes when the slowest shard responds - with 8 shards, the probability of at least one slow shard (cold cache, garbage collection, network blip) is twice as high as with 4 shards; (3) Per-shard ANN configuration: each shard now has 2× fewer vectors, so the HNSW graph is smaller and each ANN search returns a less representative sample of the global top-K, requiring more candidates per shard to maintain recall, which increases per-shard search time. To fix: reduce k_per_shard conservatively, ensure shard HNSW efSearch scales with shard size, and consider fewer larger shards if total dataset fits.

The Billion-Vector Problem​

Why Scaling Is Different for Vector Databases​

Horizontal Sharding Strategies​

Random Hash Sharding​

Semantic Clustering Sharding​

Tenant-Based Sharding​

Replication for High Availability​

Hot-Cold Tiering​

Tiering Strategy​

Memory Tiering in Qdrant​

Consistent Hashing for Dynamic Shard Counts​

Capacity Planning​

Geographic Distribution​

Backup and Restore​

Production Engineering Notes​

Common Mistakes​

Interview Questions​