What is embeddings in production?

Build, deploy, and operate production-grade embedding pipelines - caching, incremental indexing, staleness management, vector DB selection, and cost optimization at scale.

How does vector database work in practice?

Embeddings in Production covers embeddings in production, vector database, embedding cache from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/embeddings-engineering/embeddings-in-production

What is the difference between embeddings in production and embedding cache?

See the full breakdown at https://engineersofai.com/docs/llms/embeddings-engineering/embeddings-in-production

Embeddings in Production

Reading time: ~35 min | Interview relevance: Very High | Target roles: ML Engineer, AI Engineer, Backend Engineer (AI), Platform Engineer

Opening: The 3am PagerDuty Alert

It is 3am on a Tuesday. Your phone buzzes. PagerDuty: "P1 - Search relevance degraded, CTR down 34%." You open the dashboard. Embedding latency is fine. Vector DB is healthy. But the queries returning from semantic search are wrong - a user searching for "Python async performance" is getting results about "Python snake care" and "async music production."

You dig in. Three weeks ago the content team published 8,000 new technical articles. The ingestion pipeline processed them fine - embeddings generated, stored, indexed. But nobody noticed that the new articles used terminology from a different domain than the original corpus. The embedding model, trained on older text, maps "async" in technical contexts identically to "async" in music contexts because the new content shifted the query distribution away from what the model was calibrated for.

By 4am you discover a second problem: 40% of the new articles were embedded using the wrong model version - a rollout script updated the model mid-batch without versioning the vectors. Now your index contains embeddings from two different model versions, and cosine similarity between them is meaningless. The vectors are geometrically incompatible.

This scenario plays out everywhere embeddings get deployed at scale. The embedding problem is not just "generate vectors and store them." It is an operational discipline involving pipeline architecture, caching strategy, staleness detection, version management, cost control, and continuous monitoring. This final lesson in the embeddings module covers all of it - the gap between "embeddings working in a notebook" and "embeddings running reliably in production."

By the end, you will be able to design a production embedding pipeline from document ingestion to query serving, select the right vector database for your constraints, build cost-efficient batch processing with caching, detect embedding drift before it degrades your users' experience, and architect incremental indexing so you never have to re-embed your entire corpus again.

Why Production Embeddings Are Hard

Generating embeddings is easy. A few lines of code and you have vectors. Running embeddings as a production system is a different problem entirely. Here is what breaks in practice:

Cost at scale. OpenAI charges $0.13 per million tokens for `text-embedding-3-small`. A corpus of 10 million documents averaging 500 tokens each costs$ 650 to embed once. Re-embedding after a model upgrade costs another $650. Re-embedding because your chunking strategy changed costs another$ 650. Costs compound quickly without caching and careful version management.

Latency requirements conflict with batch efficiency. Embedding APIs are optimized for batches - sending 2,048 texts in one call is 100x more efficient than 2,048 single calls. But your ingestion pipeline receives documents in real time. You need to buffer documents for batching while still meeting ingestion SLAs, then serve query embeddings with sub-10ms latency while the embedding model call takes 50-200ms.

Version incompatibility. When you upgrade your embedding model, old and new vectors cannot be compared - they live in different geometric spaces. A query embedded with model v2 will not find documents embedded with model v1, even if the content is a perfect match. Every model upgrade requires re-embedding the entire corpus and either a hard cutover or a dual-index serving period.

Staleness and freshness tradeoffs. Embeddings are snapshots of content. When a document is updated, its embedding is stale. How stale is too stale? For a news site, an article updated an hour ago with new information should probably be re-embedded. For a legal document updated to fix a typo, re-embedding adds cost with negligible benefit.

Distribution shift. Your corpus changes over time. New topics, new terminology, new user query patterns. The embedding model was trained on a specific distribution. As your content drifts, the model's geometric space may stop capturing the distinctions your users care about. This is detectable - but only if you instrument for it.

The Full Production Pipeline

Before diving into individual components, here is the complete pipeline from document to query result:

Each box in this diagram represents a component you need to design, deploy, and operate. The following sections walk through the non-obvious engineering decisions at each stage.

Chunking Strategy

Before embedding anything, you chunk documents. Chunking is deceptively important - the chunk size directly determines what the embedding captures and whether retrieval will succeed.

from dataclasses import dataclass, field
from typing import List, Optional
import re
import hashlib


@dataclass
class Chunk:
    content: str
    chunk_id: str
    document_id: str
    start_char: int
    end_char: int
    chunk_index: int
    metadata: dict = field(default_factory=dict)

    def __post_init__(self):
        # Stable ID based on content hash - idempotent re-processing
        content_hash = hashlib.sha256(self.content.encode()).hexdigest()[:16]
        self.chunk_id = f"{self.document_id}_{self.chunk_index}_{content_hash}"


class SemanticChunker:
    """
    Chunk by sentence boundaries, respecting a token budget.
    Uses a sliding window with configurable overlap.
    """

    def __init__(
        self,
        max_tokens: int = 512,
        overlap_tokens: int = 64,
        tokens_per_word: float = 1.3,  # rough estimate
    ):
        self.max_tokens = max_tokens
        self.overlap_tokens = overlap_tokens
        self.max_words = int(max_tokens / tokens_per_word)
        self.overlap_words = int(overlap_tokens / tokens_per_word)

    def _split_sentences(self, text: str) -> List[str]:
        # Simple but effective: split on sentence-ending punctuation
        sentences = re.split(r'(?<=[.!?])\s+', text.strip())
        return [s.strip() for s in sentences if s.strip()]

    def chunk(self, text: str, document_id: str) -> List[Chunk]:
        sentences = self._split_sentences(text)
        chunks = []
        current_sentences = []
        current_word_count = 0
        chunk_index = 0
        char_offset = 0

        for sentence in sentences:
            word_count = len(sentence.split())

            if current_word_count + word_count > self.max_words and current_sentences:
                # Emit current chunk
                chunk_text = " ".join(current_sentences)
                chunks.append(Chunk(
                    content=chunk_text,
                    chunk_id="",  # filled by __post_init__
                    document_id=document_id,
                    start_char=char_offset,
                    end_char=char_offset + len(chunk_text),
                    chunk_index=chunk_index,
                ))
                chunk_index += 1
                char_offset += len(chunk_text) + 1

                # Overlap: keep last N words worth of sentences
                overlap_sentences = []
                overlap_count = 0
                for s in reversed(current_sentences):
                    wc = len(s.split())
                    if overlap_count + wc <= self.overlap_words:
                        overlap_sentences.insert(0, s)
                        overlap_count += wc
                    else:
                        break

                current_sentences = overlap_sentences
                current_word_count = overlap_count

            current_sentences.append(sentence)
            current_word_count += word_count

        # Emit final chunk
        if current_sentences:
            chunk_text = " ".join(current_sentences)
            chunks.append(Chunk(
                content=chunk_text,
                chunk_id="",
                document_id=document_id,
                start_char=char_offset,
                end_char=char_offset + len(chunk_text),
                chunk_index=chunk_index,
            ))

        return chunks

:::tip Chunk Size Rule of Thumb For most RAG applications: 256–512 tokens per chunk with 10–15% overlap. Shorter chunks (128 tokens) give higher precision retrieval but lose context. Longer chunks (1024 tokens) retain context but dilute the embedding signal - the model has to average over too many concepts. :::

Embedding Caching

Caching embeddings is one of the highest-ROI optimizations available. The insight: if a chunk's content has not changed, its embedding has not changed. A content hash is a perfect cache key.

import asyncio
import hashlib
import json
import time
from typing import Optional, List, Tuple
import numpy as np
import redis.asyncio as redis


class EmbeddingCache:
    """
    Two-level cache: in-memory LRU + Redis.
    Key: SHA-256 of (model_version, text_content)
    Value: serialized float32 numpy array
    """

    def __init__(
        self,
        redis_url: str,
        model_version: str,
        memory_cache_size: int = 10_000,
        ttl_seconds: int = 86400 * 30,  # 30 days
    ):
        self.redis_client = redis.from_url(redis_url, decode_responses=False)
        self.model_version = model_version
        self.ttl = ttl_seconds
        self._memory_cache: dict = {}
        self._memory_cache_size = memory_cache_size
        self._access_order: list = []

    def _cache_key(self, text: str) -> str:
        content = f"{self.model_version}:{text}"
        return f"emb:{hashlib.sha256(content.encode()).hexdigest()}"

    def _serialize(self, embedding: np.ndarray) -> bytes:
        return embedding.astype(np.float32).tobytes()

    def _deserialize(self, data: bytes) -> np.ndarray:
        return np.frombuffer(data, dtype=np.float32)

    def _evict_memory_lru(self):
        if len(self._access_order) >= self._memory_cache_size:
            oldest_key = self._access_order.pop(0)
            self._memory_cache.pop(oldest_key, None)

    async def get(self, text: str) -> Optional[np.ndarray]:
        key = self._cache_key(text)

        # L1: in-memory
        if key in self._memory_cache:
            # Move to end (most recently used)
            self._access_order.remove(key)
            self._access_order.append(key)
            return self._memory_cache[key]

        # L2: Redis
        data = await self.redis_client.get(key)
        if data is not None:
            embedding = self._deserialize(data)
            # Populate L1
            self._evict_memory_lru()
            self._memory_cache[key] = embedding
            self._access_order.append(key)
            return embedding

        return None

    async def set(self, text: str, embedding: np.ndarray) -> None:
        key = self._cache_key(text)
        data = self._serialize(embedding)

        # Write to both levels
        await self.redis_client.setex(key, self.ttl, data)

        self._evict_memory_lru()
        self._memory_cache[key] = embedding
        self._access_order.append(key)

    async def get_batch(
        self, texts: List[str]
    ) -> Tuple[List[Optional[np.ndarray]], List[int]]:
        """
        Returns (embeddings_or_none, cache_miss_indices).
        Callers only need to embed the cache misses.
        """
        results = [None] * len(texts)
        miss_indices = []

        # Check L1 first (no async needed)
        redis_miss_indices = []
        for i, text in enumerate(texts):
            key = self._cache_key(text)
            if key in self._memory_cache:
                results[i] = self._memory_cache[key]
            else:
                redis_miss_indices.append(i)

        # Batch Redis lookup for L1 misses
        if redis_miss_indices:
            keys = [self._cache_key(texts[i]) for i in redis_miss_indices]
            values = await self.redis_client.mget(keys)
            for i, (idx, value) in enumerate(zip(redis_miss_indices, values)):
                if value is not None:
                    results[idx] = self._deserialize(value)
                else:
                    miss_indices.append(idx)

        return results, miss_indices

Cache Hit Rate in Practice

A well-configured embedding cache achieves 70-90% hit rates in typical RAG systems. Most queries are not unique - users ask similar questions repeatedly, and popular documents appear in many retrieval results. The cache pays for itself within days.

class CacheMetrics:
    """Track cache performance for monitoring dashboards."""

    def __init__(self):
        self.hits = 0
        self.misses = 0
        self.bytes_saved = 0  # estimate based on avoided API calls

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

    @property
    def api_calls_avoided(self) -> int:
        return self.hits

    def record_hit(self, embedding_dim: int = 1536):
        self.hits += 1
        # Each avoided API call saves ~4 bytes/dim for float32
        self.bytes_saved += embedding_dim * 4

    def record_miss(self):
        self.misses += 1

    def __repr__(self):
        return (
            f"CacheMetrics(hit_rate={self.hit_rate:.1%}, "
            f"hits={self.hits}, misses={self.misses}, "
            f"api_calls_avoided={self.api_calls_avoided})"
        )

Async Batch Embedding Pipeline

The core challenge: embedding APIs prefer large batches, but documents arrive continuously. The solution is a buffered async pipeline with Redis as the queue.

import asyncio
import time
from dataclasses import dataclass, field
from typing import List, Dict, Callable, Awaitable
from openai import AsyncOpenAI
import numpy as np


@dataclass
class EmbeddingJob:
    job_id: str
    texts: List[str]
    model: str
    metadata: dict = field(default_factory=dict)
    created_at: float = field(default_factory=time.time)


@dataclass
class EmbeddingResult:
    job_id: str
    embeddings: List[np.ndarray]
    model: str
    token_count: int
    latency_ms: float
    cache_hits: int = 0
    cache_misses: int = 0


class AsyncEmbeddingWorker:
    """
    Pulls jobs from a queue, batches texts, calls the embedding API,
    updates cache, and stores results in the vector DB.
    """

    def __init__(
        self,
        openai_client: AsyncOpenAI,
        cache: "EmbeddingCache",
        vector_store: "VectorStore",
        model: str = "text-embedding-3-small",
        max_batch_size: int = 2048,
        max_wait_ms: int = 100,  # max time to wait for a full batch
        concurrency: int = 4,    # parallel API calls
    ):
        self.client = openai_client
        self.cache = cache
        self.vector_store = vector_store
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.semaphore = asyncio.Semaphore(concurrency)
        self._queue: asyncio.Queue = asyncio.Queue()
        self._running = False

    async def submit(self, job: EmbeddingJob) -> None:
        await self._queue.put(job)

    async def _embed_batch(
        self, texts: List[str]
    ) -> List[np.ndarray]:
        """Call OpenAI API with retry and exponential backoff."""
        for attempt in range(5):
            try:
                async with self.semaphore:
                    start = time.time()
                    response = await self.client.embeddings.create(
                        input=texts,
                        model=self.model,
                    )
                    latency = (time.time() - start) * 1000
                    embeddings = [
                        np.array(item.embedding, dtype=np.float32)
                        for item in sorted(response.data, key=lambda x: x.index)
                    ]
                    return embeddings
            except Exception as e:
                if attempt == 4:
                    raise
                wait = (2 ** attempt) + (0.1 * attempt)
                await asyncio.sleep(wait)

    async def process_job(self, job: EmbeddingJob) -> EmbeddingResult:
        start = time.time()

        # Check cache for each text
        cached_embeddings, miss_indices = await self.cache.get_batch(job.texts)

        # Only embed cache misses
        embeddings_to_compute = [job.texts[i] for i in miss_indices]
        computed = []
        total_tokens = 0

        if embeddings_to_compute:
            # Split into API-safe batches
            api_batch_size = 2048
            for i in range(0, len(embeddings_to_compute), api_batch_size):
                batch = embeddings_to_compute[i:i + api_batch_size]
                batch_embeddings = await self._embed_batch(batch)
                computed.extend(batch_embeddings)

                # Estimate tokens (avg 1.3 tokens/word)
                total_tokens += sum(
                    len(t.split()) for t in batch
                )

        # Merge cached + computed
        for i, miss_idx in enumerate(miss_indices):
            cached_embeddings[miss_idx] = computed[i]
            # Write back to cache
            await self.cache.set(job.texts[miss_idx], computed[i])

        latency_ms = (time.time() - start) * 1000
        return EmbeddingResult(
            job_id=job.job_id,
            embeddings=cached_embeddings,  # type: ignore
            model=self.model,
            token_count=total_tokens,
            latency_ms=latency_ms,
            cache_hits=len(job.texts) - len(miss_indices),
            cache_misses=len(miss_indices),
        )

    async def run(self):
        """Main worker loop."""
        self._running = True
        while self._running:
            try:
                job = await asyncio.wait_for(
                    self._queue.get(), timeout=self.max_wait_ms / 1000
                )
                result = await self.process_job(job)
                # Store in vector DB
                await self.vector_store.upsert(result)
                self._queue.task_done()
            except asyncio.TimeoutError:
                continue
            except Exception as e:
                print(f"Worker error: {e}")

Running Multiple Workers

async def run_embedding_pipeline(
    num_workers: int = 4,
    redis_url: str = "redis://localhost:6379",
    model_version: str = "text-embedding-3-small-v1",
):
    client = AsyncOpenAI()
    cache = EmbeddingCache(redis_url, model_version)
    vector_store = QdrantVectorStore()  # see below

    workers = [
        AsyncEmbeddingWorker(client, cache, vector_store)
        for _ in range(num_workers)
    ]

    # Run all workers concurrently
    await asyncio.gather(*[w.run() for w in workers])

Incremental Indexing

Re-embedding your entire corpus after every change is expensive and unnecessary. Incremental indexing only processes new or changed documents.

import hashlib
from typing import Set, Dict, Optional
from datetime import datetime
import asyncpg


class DocumentRegistry:
    """
    Tracks document → embedding mapping.
    Enables incremental indexing: only re-embed changed documents.
    Schema:
        document_id TEXT PRIMARY KEY
        content_hash TEXT NOT NULL
        embedding_ids TEXT[] NOT NULL  -- one per chunk
        model_version TEXT NOT NULL
        embedded_at TIMESTAMPTZ
        last_modified TIMESTAMPTZ
    """

    def __init__(self, db_pool: asyncpg.Pool):
        self.db = db_pool

    def _content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()

    async def needs_reembedding(
        self,
        document_id: str,
        content: str,
        current_model_version: str,
    ) -> bool:
        """
        Returns True if the document needs to be re-embedded.
        Reasons: new document, content changed, model version changed.
        """
        row = await self.db.fetchrow(
            """
            SELECT content_hash, model_version
            FROM document_registry
            WHERE document_id = $1
            """,
            document_id,
        )

        if row is None:
            return True  # New document

        content_changed = row["content_hash"] != self._content_hash(content)
        model_changed = row["model_version"] != current_model_version

        return content_changed or model_changed

    async def register(
        self,
        document_id: str,
        content: str,
        embedding_ids: list[str],
        model_version: str,
    ) -> None:
        await self.db.execute(
            """
            INSERT INTO document_registry
                (document_id, content_hash, embedding_ids, model_version, embedded_at, last_modified)
            VALUES ($1, $2, $3, $4, NOW(), NOW())
            ON CONFLICT (document_id) DO UPDATE SET
                content_hash = EXCLUDED.content_hash,
                embedding_ids = EXCLUDED.embedding_ids,
                model_version = EXCLUDED.model_version,
                embedded_at = NOW(),
                last_modified = NOW()
            """,
            document_id,
            self._content_hash(content),
            embedding_ids,
            model_version,
        )

    async def get_stale_documents(
        self,
        current_model_version: str,
        limit: int = 1000,
    ) -> list[str]:
        """Find documents embedded with an older model version."""
        rows = await self.db.fetch(
            """
            SELECT document_id
            FROM document_registry
            WHERE model_version != $1
            ORDER BY embedded_at ASC
            LIMIT $2
            """,
            current_model_version,
            limit,
        )
        return [row["document_id"] for row in rows]

    async def delete(self, document_id: str) -> list[str]:
        """Returns embedding IDs to delete from vector store."""
        row = await self.db.fetchrow(
            "DELETE FROM document_registry WHERE document_id = $1 RETURNING embedding_ids",
            document_id,
        )
        return row["embedding_ids"] if row else []

Model Version Migration Strategy

When you upgrade your embedding model, you cannot compare old and new vectors directly. Here is a safe migration pattern:

:::warning Never Mix Model Versions in One Index A single vector index must contain embeddings from exactly one model. Mixing v1 and v2 embeddings produces incorrect similarity scores - cosine similarity between vectors from different spaces is geometrically meaningless. Use the DocumentRegistry to track which model version embedded each document. :::

Staleness Management

Not all documents need to be re-embedded at the same frequency. A staleness policy should be content-aware:

from enum import Enum
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional


class StalenessPolicy(Enum):
    NEVER = "never"           # Static content (books, papers)
    WEEKLY = "weekly"         # Slow-changing docs (product docs)
    DAILY = "daily"           # News, blog posts
    ON_CHANGE = "on_change"   # Detected content modification
    ON_MODEL_UPGRADE = "on_model_upgrade"  # Only when model changes


@dataclass
class DocumentStalenessConfig:
    document_id: str
    policy: StalenessPolicy
    last_embedded: datetime
    content_version: int
    model_version: str


class StalenessManager:

    POLICY_TTL = {
        StalenessPolicy.NEVER: None,
        StalenessPolicy.WEEKLY: timedelta(days=7),
        StalenessPolicy.DAILY: timedelta(days=1),
        StalenessPolicy.ON_CHANGE: None,  # check content hash
        StalenessPolicy.ON_MODEL_UPGRADE: None,  # check model version
    }

    def is_stale(
        self,
        config: DocumentStalenessConfig,
        current_content_version: int,
        current_model_version: str,
        now: Optional[datetime] = None,
    ) -> bool:
        now = now or datetime.utcnow()

        if config.policy == StalenessPolicy.NEVER:
            return False

        if config.policy == StalenessPolicy.ON_MODEL_UPGRADE:
            return config.model_version != current_model_version

        if config.policy == StalenessPolicy.ON_CHANGE:
            return config.content_version != current_content_version

        ttl = self.POLICY_TTL[config.policy]
        if ttl and (now - config.last_embedded) > ttl:
            return True

        # Always re-embed on model upgrade regardless of time policy
        if config.model_version != current_model_version:
            return True

        return False

    def assign_policy(self, document_type: str, update_frequency: str) -> StalenessPolicy:
        """Heuristic policy assignment based on document characteristics."""
        if document_type in ("academic_paper", "book_chapter"):
            return StalenessPolicy.NEVER
        if document_type in ("news_article", "social_post"):
            return StalenessPolicy.DAILY
        if update_frequency == "high":
            return StalenessPolicy.ON_CHANGE
        return StalenessPolicy.WEEKLY

Vector Database Selection

Choosing the right vector database is one of the most consequential architectural decisions. The wrong choice is expensive to reverse.

Vector DB Comparison

Feature	pgvector	Qdrant	Weaviate	Pinecone
Self-hosted	Yes	Yes	Yes	No
Managed cloud	Via Supabase/Neon	Yes	Yes	Yes
Max vectors (practical)	~5M	100M+	50M+	Unlimited
Quantization	No	Binary, int8	int8	No
Hybrid search	Manual	Sparse+dense	Native BM25	Sparse+dense
Payload filtering	SQL WHERE	Native	GraphQL	Metadata filters
Multi-tenancy	Schemas	Collections	Classes	Namespaces
Cost at 10M vectors	DB cost only	$200-500/mo	$300-600/mo	$700-1200/mo

Qdrant Production Setup

Qdrant is the recommended choice for most production deployments. It offers binary quantization, native payload filtering, and excellent performance at a competitive price.

from qdrant_client import AsyncQdrantClient, models
from qdrant_client.models import (
    VectorParams, Distance, QuantizationConfig,
    BinaryQuantization, BinaryQuantizationConfig,
    HnswConfigDiff, OptimizersConfigDiff,
)
import numpy as np
from typing import List, Dict, Any


class QdrantVectorStore:

    def __init__(
        self,
        url: str = "http://localhost:6333",
        collection_name: str = "documents",
        embedding_dim: int = 1536,
        model_version: str = "text-embedding-3-small-v1",
    ):
        self.client = AsyncQdrantClient(url=url)
        self.collection_name = collection_name
        self.embedding_dim = embedding_dim
        self.model_version = model_version

    async def create_collection(self, use_binary_quantization: bool = True):
        """
        Create a Qdrant collection optimized for production.
        Binary quantization: 32x compression with ~5-10% accuracy loss.
        """
        quantization_config = None
        if use_binary_quantization:
            quantization_config = models.BinaryQuantization(
                binary=models.BinaryQuantizationConfig(
                    always_ram=True,  # Keep binary vectors in RAM
                )
            )

        await self.client.create_collection(
            collection_name=self.collection_name,
            vectors_config=VectorParams(
                size=self.embedding_dim,
                distance=Distance.COSINE,
                on_disk=True,  # Store full vectors on disk, binary in RAM
            ),
            quantization_config=quantization_config,
            hnsw_config=HnswConfigDiff(
                m=16,               # Graph connections per node
                ef_construct=200,   # Construction quality (higher = better index)
                full_scan_threshold=10_000,
            ),
            optimizers_config=OptimizersConfigDiff(
                indexing_threshold=20_000,  # Start indexing after N vectors
                memmap_threshold=50_000,    # Use memory-mapped files above this
            ),
        )

    async def upsert(self, result: "EmbeddingResult", chunks: List["Chunk"]) -> None:
        """Store embeddings with chunk metadata as payload."""
        points = []
        for chunk, embedding in zip(chunks, result.embeddings):
            points.append(models.PointStruct(
                id=self._chunk_id_to_int(chunk.chunk_id),
                vector=embedding.tolist(),
                payload={
                    "chunk_id": chunk.chunk_id,
                    "document_id": chunk.document_id,
                    "content": chunk.content,
                    "chunk_index": chunk.chunk_index,
                    "model_version": self.model_version,
                    "embedded_at": int(time.time()),
                    **chunk.metadata,
                },
            ))

        await self.client.upsert(
            collection_name=self.collection_name,
            points=points,
            wait=True,
        )

    async def search(
        self,
        query_embedding: np.ndarray,
        top_k: int = 20,
        filter_conditions: dict = None,
        use_rescore: bool = True,
    ) -> List[Dict[str, Any]]:
        """
        Search with binary quantization + full-vector rescoring.
        The oversampling parameter controls how many candidates
        the binary index returns before rescoring with full vectors.
        """
        search_params = models.SearchParams(
            quantization=models.QuantizationSearchParams(
                ignore=False,
                rescore=use_rescore,
                oversampling=3.0,  # Fetch 3x candidates for rescoring
            )
        )

        filter_ = None
        if filter_conditions:
            must_conditions = [
                models.FieldCondition(
                    key=k,
                    match=models.MatchValue(value=v),
                )
                for k, v in filter_conditions.items()
            ]
            filter_ = models.Filter(must=must_conditions)

        results = await self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=top_k,
            query_filter=filter_,
            search_params=search_params,
            with_payload=True,
        )

        return [
            {
                "chunk_id": r.payload["chunk_id"],
                "document_id": r.payload["document_id"],
                "content": r.payload["content"],
                "score": r.score,
                "metadata": {k: v for k, v in r.payload.items()
                             if k not in ("chunk_id", "document_id", "content")},
            }
            for r in results
        ]

    def _chunk_id_to_int(self, chunk_id: str) -> int:
        """Qdrant requires integer or UUID point IDs."""
        return int(hashlib.md5(chunk_id.encode()).hexdigest()[:8], 16)

pgvector for Smaller Scale

If you are already running PostgreSQL and your corpus is under 5 million vectors, pgvector is the simplest choice:

# pgvector setup
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_embeddings (
    id          BIGSERIAL PRIMARY KEY,
    chunk_id    TEXT UNIQUE NOT NULL,
    document_id TEXT NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1536),
    model_version TEXT NOT NULL,
    metadata    JSONB DEFAULT '{}',
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index (better than IVFFlat for most use cases)
CREATE INDEX ON document_embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Similarity search
SELECT chunk_id, content, 1 - (embedding <=> $1::vector) AS score
FROM document_embeddings
WHERE model_version = $2
ORDER BY embedding <=> $1::vector
LIMIT 20;

import asyncpg
import numpy as np


class PGVectorStore:

    def __init__(self, db_pool: asyncpg.Pool, model_version: str):
        self.db = db_pool
        self.model_version = model_version

    async def upsert(self, chunk_id: str, content: str, document_id: str,
                     embedding: np.ndarray, metadata: dict = None) -> None:
        embedding_str = "[" + ",".join(str(x) for x in embedding.tolist()) + "]"
        await self.db.execute(
            """
            INSERT INTO document_embeddings
                (chunk_id, document_id, content, embedding, model_version, metadata)
            VALUES ($1, $2, $3, $4::vector, $5, $6)
            ON CONFLICT (chunk_id) DO UPDATE SET
                content = EXCLUDED.content,
                embedding = EXCLUDED.embedding,
                model_version = EXCLUDED.model_version,
                metadata = EXCLUDED.metadata
            """,
            chunk_id, document_id, content, embedding_str,
            self.model_version, metadata or {},
        )

    async def search(
        self, query_embedding: np.ndarray, top_k: int = 20,
        filter_doc_type: str = None,
    ) -> list:
        embedding_str = "[" + ",".join(str(x) for x in query_embedding.tolist()) + "]"

        where_clause = "WHERE model_version = $2"
        params = [embedding_str, self.model_version]
        if filter_doc_type:
            where_clause += " AND metadata->>'doc_type' = $3"
            params.append(filter_doc_type)

        rows = await self.db.fetch(
            f"""
            SELECT chunk_id, document_id, content,
                   1 - (embedding <=> $1::vector) AS score,
                   metadata
            FROM document_embeddings
            {where_clause}
            ORDER BY embedding <=> $1::vector
            LIMIT {top_k}
            """,
            *params,
        )
        return [dict(row) for row in rows]

Monitoring Embedding Quality in Production

Embedding quality degradation is silent - the pipeline keeps running, vectors keep being stored, and queries keep returning results. The results just get worse. You need active monitoring.

import numpy as np
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Tuple
import time


class EmbeddingQualityMonitor:
    """
    Detects distribution shift and cluster coherence degradation.
    Run periodically (hourly or daily) against a sample of your index.
    """

    def __init__(self, baseline_embeddings: np.ndarray):
        """
        baseline_embeddings: representative sample from a known-good period.
        Shape: (N, embedding_dim)
        """
        self.baseline_mean = baseline_embeddings.mean(axis=0)
        self.baseline_std = baseline_embeddings.std(axis=0)
        self.baseline_norms = np.linalg.norm(baseline_embeddings, axis=1)
        self.baseline_centroid_distances = self._centroid_distances(baseline_embeddings)

    def _centroid_distances(self, embeddings: np.ndarray) -> np.ndarray:
        centroid = embeddings.mean(axis=0)
        diffs = embeddings - centroid
        return np.linalg.norm(diffs, axis=1)

    def distribution_shift_score(self, current_embeddings: np.ndarray) -> float:
        """
        Measures how much the current embedding distribution has shifted
        from the baseline. Returns a score in [0, 1] where 0 = no shift.
        Uses Maximum Mean Discrepancy (MMD) approximation.
        """
        current_mean = current_embeddings.mean(axis=0)
        # Cosine distance between mean vectors
        mean_shift = cosine(self.baseline_mean, current_mean)
        return float(mean_shift)

    def cluster_coherence(self, embeddings: np.ndarray, sample_size: int = 500) -> float:
        """
        Measures average intra-cluster similarity.
        High coherence (>0.6) = semantically tight index.
        Low coherence (<0.4) = index is poorly organized.
        """
        if len(embeddings) > sample_size:
            indices = np.random.choice(len(embeddings), sample_size, replace=False)
            sample = embeddings[indices]
        else:
            sample = embeddings

        # Compute pairwise cosine similarities
        sim_matrix = cosine_similarity(sample)
        # Exclude diagonal (self-similarity = 1.0)
        mask = np.ones(sim_matrix.shape, dtype=bool)
        np.fill_diagonal(mask, False)
        return float(sim_matrix[mask].mean())

    def norm_distribution_check(self, current_embeddings: np.ndarray) -> dict:
        """
        Checks if embedding norms are within expected range.
        Sudden norm changes indicate model or preprocessing issues.
        """
        current_norms = np.linalg.norm(current_embeddings, axis=1)
        return {
            "baseline_mean_norm": float(self.baseline_norms.mean()),
            "current_mean_norm": float(current_norms.mean()),
            "norm_drift_pct": float(
                abs(current_norms.mean() - self.baseline_norms.mean())
                / self.baseline_norms.mean() * 100
            ),
            "within_tolerance": bool(
                abs(current_norms.mean() - self.baseline_norms.mean())
                / self.baseline_norms.mean() < 0.05  # 5% tolerance
            ),
        }

    def generate_health_report(self, current_embeddings: np.ndarray) -> dict:
        shift = self.distribution_shift_score(current_embeddings)
        coherence = self.cluster_coherence(current_embeddings)
        norm_check = self.norm_distribution_check(current_embeddings)

        status = "healthy"
        alerts = []

        if shift > 0.1:
            status = "warning"
            alerts.append(f"Distribution shift detected: {shift:.3f} (threshold: 0.1)")
        if shift > 0.25:
            status = "critical"
            alerts.append("Severe distribution shift - consider model retraining or re-embedding")

        if coherence < 0.3:
            alerts.append(f"Low cluster coherence: {coherence:.3f} - index may be poorly organized")

        if not norm_check["within_tolerance"]:
            alerts.append(f"Embedding norm drift: {norm_check['norm_drift_pct']:.1f}%")

        return {
            "status": status,
            "distribution_shift": shift,
            "cluster_coherence": coherence,
            "norm_check": norm_check,
            "alerts": alerts,
            "timestamp": time.time(),
        }

Online Metrics: Query-Level Monitoring

Beyond offline distribution analysis, monitor per-query signals in real time:

from dataclasses import dataclass
from typing import List
import time


@dataclass
class QueryMetrics:
    query_id: str
    query_text: str
    top_k_scores: List[float]  # cosine similarity scores
    latency_ms: float
    cache_hit: bool
    result_count: int
    timestamp: float = 0.0

    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = time.time()

    @property
    def top_score(self) -> float:
        return max(self.top_k_scores) if self.top_k_scores else 0.0

    @property
    def score_gap(self) -> float:
        """Gap between top-1 and top-2 score - high gap = high confidence."""
        if len(self.top_k_scores) >= 2:
            sorted_scores = sorted(self.top_k_scores, reverse=True)
            return sorted_scores[0] - sorted_scores[1]
        return 0.0

    @property
    def low_confidence(self) -> bool:
        """Flag queries where retrieval confidence is low."""
        return self.top_score < 0.65  # Threshold depends on your embedding model


class QueryMetricsAggregator:
    """
    Aggregates query metrics for dashboard reporting.
    Emit to Prometheus/Datadog/CloudWatch.
    """

    def __init__(self, window_minutes: int = 60):
        self.window = window_minutes * 60  # seconds
        self._metrics: List[QueryMetrics] = []

    def record(self, metrics: QueryMetrics):
        self._metrics.append(metrics)
        # Evict old metrics
        cutoff = time.time() - self.window
        self._metrics = [m for m in self._metrics if m.timestamp > cutoff]

    def summary(self) -> dict:
        if not self._metrics:
            return {}

        scores = [m.top_score for m in self._metrics]
        latencies = [m.latency_ms for m in self._metrics]
        low_conf = sum(1 for m in self._metrics if m.low_confidence)
        cache_hits = sum(1 for m in self._metrics if m.cache_hit)

        return {
            "query_count": len(self._metrics),
            "avg_top_score": sum(scores) / len(scores),
            "p50_latency_ms": sorted(latencies)[len(latencies) // 2],
            "p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)],
            "low_confidence_rate": low_conf / len(self._metrics),
            "cache_hit_rate": cache_hits / len(self._metrics),
        }

:::tip Alert Thresholds to Monitor Set up alerts for: (1) avg_top_score drops below 0.7 for 5+ minutes - potential distribution shift. (2) low_confidence_rate exceeds 20% - many queries have no good matches, possibly new query patterns your corpus does not cover. (3) p99_latency_ms exceeds 500ms - embedding API or vector DB degradation. (4) cache_hit_rate drops below 30% - cache expiry too aggressive or query diversity increased. :::

Cost Optimization

Embedding costs are controllable with the right strategy.

from enum import Enum
from dataclasses import dataclass
from typing import Callable
import numpy as np


class ModelTier(Enum):
    CHEAP = "text-embedding-3-small"    # $0.02/1M tokens
    STANDARD = "text-embedding-3-large" # $0.13/1M tokens
    PREMIUM = "voyage-3-large"          # $0.18/1M tokens (via Voyage)


@dataclass
class CostOptimizationConfig:
    # Use cheap model for short, simple queries
    cheap_model_max_tokens: int = 100
    # Use standard model for normal queries
    standard_model_max_tokens: int = 1000
    # Use premium model for long, complex documents
    premium_model_min_tokens: int = 1001


class TieredEmbeddingClient:
    """
    Routes embedding requests to the cheapest suitable model
    based on text length and query type.
    """

    def __init__(self, config: CostOptimizationConfig = None):
        self.config = config or CostOptimizationConfig()
        self.costs = {
            ModelTier.CHEAP: 0.020 / 1_000_000,
            ModelTier.STANDARD: 0.130 / 1_000_000,
            ModelTier.PREMIUM: 0.180 / 1_000_000,
        }
        self._total_cost = 0.0
        self._token_counts = {tier: 0 for tier in ModelTier}

    def _select_model(self, text: str, is_query: bool = True) -> ModelTier:
        token_estimate = len(text.split()) * 1.3

        if is_query and token_estimate <= self.config.cheap_model_max_tokens:
            return ModelTier.CHEAP

        if token_estimate <= self.config.standard_model_max_tokens:
            return ModelTier.STANDARD

        return ModelTier.PREMIUM

    async def embed(
        self, text: str, is_query: bool = True
    ) -> tuple[np.ndarray, ModelTier]:
        tier = self._select_model(text, is_query)
        token_count = int(len(text.split()) * 1.3)

        # [actual API call would go here]
        embedding = np.random.randn(1536).astype(np.float32)  # placeholder
        embedding /= np.linalg.norm(embedding)

        # Track cost
        self._token_counts[tier] += token_count
        self._total_cost += token_count * self.costs[tier]

        return embedding, tier

    def cost_report(self) -> dict:
        return {
            "total_cost_usd": round(self._total_cost, 4),
            "tokens_by_tier": {tier.value: count for tier, count in self._token_counts.items()},
            "cost_by_tier": {
                tier.value: round(count * self.costs[tier], 4)
                for tier, count in self._token_counts.items()
            },
        }

Cost Levers in Priority Order

Cache aggressively - 70-90% hit rates are achievable. Every hit is free.
Use smaller models for queries - Queries are short (10-50 tokens). text-embedding-3-small at $0.02/1M tokens is 6.5x cheaper than large for the same query.
Batch documents maximally - Send 2,048 texts per API call. Single-text calls have significant overhead.
Quantize stored vectors - Binary quantization uses 32x less memory. At 10M vectors × 1536 dims × 4 bytes, that is 61GB → 1.9GB.
Use Matryoshka truncation - text-embedding-3 models support dimension reduction at query time. 256 dims instead of 1536 = 6x smaller index with minimal accuracy loss for most tasks.

The Complete Production Architecture

Reranking: The Final Quality Layer

ANN search returns approximate nearest neighbors efficiently but sacrifices some accuracy. A cross-encoder reranker rescores the top-K candidates with higher accuracy - at the cost of calling a more expensive model on a smaller set.

from sentence_transformers import CrossEncoder
import numpy as np
from typing import List, Dict, Any


class CrossEncoderReranker:
    """
    Takes top-K ANN results and reranks with a cross-encoder.
    Cross-encoders jointly encode query+document (no separate embeddings)
    - more accurate but O(K) inference, not sublinear like ANN.
    """

    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n: int = 5,
    ):
        self.model = CrossEncoder(model_name)
        self.top_n = top_n

    def rerank(
        self,
        query: str,
        candidates: List[Dict[str, Any]],
    ) -> List[Dict[str, Any]]:
        if not candidates:
            return []

        # Prepare (query, passage) pairs
        pairs = [(query, c["content"]) for c in candidates]

        # Cross-encoder scores: higher = more relevant
        scores = self.model.predict(pairs)

        # Sort by reranker score
        reranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True,
        )

        return [
            {**candidate, "rerank_score": float(score)}
            for candidate, score in reranked[:self.top_n]
        ]


class RetrievalPipeline:
    """End-to-end: query → ANN → rerank → results."""

    def __init__(
        self,
        embedder: "AsyncEmbeddingWorker",
        vector_store: "QdrantVectorStore",
        reranker: CrossEncoderReranker,
        ann_top_k: int = 20,   # fetch 20 from ANN
        final_top_n: int = 5,  # return 5 after reranking
    ):
        self.embedder = embedder
        self.vector_store = vector_store
        self.reranker = reranker
        self.ann_top_k = ann_top_k
        self.final_top_n = final_top_n

    async def retrieve(
        self,
        query: str,
        filter_conditions: dict = None,
    ) -> List[Dict[str, Any]]:
        # 1. Embed query (real-time, not batched)
        query_embedding = await self._embed_query(query)

        # 2. ANN search
        candidates = await self.vector_store.search(
            query_embedding=query_embedding,
            top_k=self.ann_top_k,
            filter_conditions=filter_conditions,
        )

        # 3. Rerank
        return self.reranker.rerank(query, candidates)

    async def _embed_query(self, query: str) -> np.ndarray:
        # In production: use a dedicated low-latency query embedder
        # that skips the queue and calls the API directly
        from openai import AsyncOpenAI
        client = AsyncOpenAI()
        response = await client.embeddings.create(
            input=query,
            model="text-embedding-3-small",
        )
        return np.array(response.data[0].embedding, dtype=np.float32)

Common Production Mistakes

:::danger Mixed Model Versions in One Index The single most common production failure. Vectors from different embedding models are not comparable - they live in different geometric spaces. A migration that updates the model halfway through leaves the index in a broken state where half the queries return wrong results. Always: version your vectors, track model version per document, and do hard cutover between model versions. :::

:::danger No Cache Key Versioning A cache key that is just a hash of the text content will return stale cached embeddings after a model upgrade - because the text has not changed but the model has. Always include the model version in the cache key: SHA256(model_version + text). :::

:::warning Embedding at Query Time Without Timeout Query embedding calls are synchronous and on the critical path. An embedding API outage or slow response directly degrades your query latency. Always set aggressive timeouts on query-time embedding calls (100-200ms max) and have a fallback strategy - either a locally-hosted model or a cached embedding for the most common queries. :::

:::warning Ignoring Embedding Norms After Preprocessing Common preprocessing steps like Unicode normalization, whitespace stripping, and HTML cleanup can change embedding norms in unexpected ways. Embed a test set before and after preprocessing changes and compare norm distributions. A 10%+ change in average norm is a signal that something structural changed. :::

:::warning Re-Embedding Everything on Every Model Update A model upgrade does not require immediate re-embedding of the entire corpus. Use the staleness policy system - re-embed high-priority documents first (recently accessed, frequently retrieved), then work through lower-priority documents over time. This amortizes the cost across days or weeks instead of a spike. :::

:::danger Vector DB Without Backup Strategy Qdrant, Weaviate, and other vector DBs are stateful services. Losing the index means re-embedding everything from scratch - potentially days of work and significant cost. Set up snapshots (Qdrant supports collection snapshots via API) and test restoration before you need it. :::

Interview Q&A

Q1: Explain the complete lifecycle of an embedding in a production RAG system, from document ingestion to query result.

Answer:

A document enters the pipeline through an ingestion endpoint or crawler. Before embedding, it passes through preprocessing - HTML stripping, Unicode normalization, deduplication - and then chunking into segments of 256-512 tokens with ~10% overlap. Each chunk gets a stable ID derived from a hash of its content, which enables idempotent processing.

The chunk is then checked against a document registry - a mapping of document ID, content hash, and model version. If the content hash and model version match a previous embedding, the chunk is skipped (incremental indexing). Otherwise, it enters an async embedding queue backed by Redis. Worker processes pull jobs from the queue, check the embedding cache (Redis-backed, keyed on SHA256 of model version + text), call the embedding API for cache misses, and write results back to both the cache and the vector store.

At query time, the flow reverses: the user's query is embedded in real time (bypassing the queue), passed to the ANN index for top-K approximate nearest neighbor search, and then the candidates are reranked by a cross-encoder for higher accuracy. The final results are merged with metadata from PostgreSQL and returned to the caller.

The critical operational concerns: every vector in the index must be from the same model version (enforced by the document registry), staleness policies determine when re-embedding is triggered, and the monitoring layer continuously checks for distribution shift, norm drift, and query confidence scores.

Q2: How do you safely migrate to a new embedding model without service disruption?

Answer:

The core challenge is that vectors from different embedding models are geometrically incompatible - you cannot run queries from a new model against an index built with an old model. The migration must be atomic from the perspective of the query serving layer.

The safe migration pattern is a dual-index strategy. While the production index (model v1) continues serving traffic, you build a shadow index by re-embedding the entire corpus with model v2. This runs in the background over hours or days. The document registry tracks which model version embedded each document, so the migration job only needs to process documents that are not yet in the v2 index.

Once the shadow index is populated, you run an A/B test: route 5-10% of traffic to the v2 index and measure retrieval quality metrics - nDCG@10, user click-through rate, session engagement, and low-confidence query rate. If v2 outperforms v1, gradually increase the traffic split (10% → 25% → 50% → 100%) over days. If v2 shows regressions, roll back immediately by routing all traffic back to v1.

The v1 index is decommissioned only after v2 is serving 100% of traffic and has been stable for several days. The entire migration is possible with zero downtime because you are always serving from a complete, consistent index - never a partially-migrated one.

Q3: Walk me through how you would debug a sudden drop in semantic search quality.

Answer:

Start with signals, not speculation. Pull query metrics from the last 24-48 hours: average top-1 cosine similarity score, low-confidence query rate, and distribution shift score from the embedding quality monitor. A sudden drop in average score with stable latency and no infrastructure changes points to a content or model issue rather than infrastructure.

Check the document registry for any recent ingestion activity. If a large batch of documents was recently indexed, inspect them for domain mismatch - content from a different topic area than the bulk of your corpus will cause retrieval confusion. Also check whether all documents in that batch were embedded with the same model version.

Run the cluster coherence metric on a sample of recently-indexed embeddings vs. the baseline. A coherence drop below 0.4 confirms that the new content is geometrically scattered relative to the existing index - the model's embedding space does not capture the distinction your users need.

Check the embedding cache for version key collisions. If the model was upgraded without bumping the model version string in cache keys, cached embeddings from the old model may be served for new queries - causing invisible version mixing.

Once the root cause is identified: if it is content mismatch, consider domain-specific fine-tuning or a separate index for the new content type. If it is model version mixing, purge the affected vectors from the index and re-embed them with the correct model. If it is cache key collision, invalidate the affected cache keys and re-embed.

Q4: How do you control embedding costs at scale without sacrificing quality?

Answer:

Embedding cost optimization has five main levers, in order of impact:

First, aggressive caching. Embeddings are deterministic - same text, same model, same vector. Cache hit rates of 70-90% are achievable in most production systems. A corpus of 10M documents costs $650 to embed once but essentially zero to re-serve from cache. Use a two-level cache: in-memory LRU for hot keys, Redis for warm keys, with a TTL of 30 days minimum. Key the cache on SHA256 of (model_version + text) to avoid serving stale embeddings after model upgrades.

Second, tiered model selection. Use the smallest model that meets quality requirements for each use case. User queries are typically 10-50 tokens - text-embedding-3-small at $0.02/1M tokens is 6.5x cheaper than text-embedding-3-large with minimal quality difference for short texts. Reserve expensive models for long documents and critical applications.

Third, maximum batching. OpenAI accepts up to 2,048 texts per API call. Single-text calls add ~50ms overhead per call. Batch ingestion documents and process them with a worker pool that accumulates texts until the batch is full or a timeout expires (100ms max wait).

Fourth, Matryoshka dimension reduction. text-embedding-3 models support the dimensions parameter, which truncates to a smaller embedding while maintaining quality better than post-hoc PCA. Using 256 dimensions instead of 1536 reduces storage and ANN computation by 6x with typically less than 5% nDCG loss on most benchmarks.

Fifth, binary quantization in the vector store. This reduces index size by 32x (float32 → 1 bit per dimension) with 5-10% accuracy loss, recoverable by rescoring the top candidates with full vectors. Qdrant's oversampling parameter (set to 3.0) fetches 3x candidates from the binary index before rescoring with full vectors, recovering most accuracy.

Q5: What is the difference between incremental indexing and full re-indexing, and when do you choose each?

Answer:

Incremental indexing means only processing documents that are new or changed since the last indexing run. Full re-indexing means reprocessing the entire corpus from scratch.

Incremental indexing is the default operational mode. A document registry stores (document_id, content_hash, model_version, embedding_ids). On each ingestion run, you compute the content hash of incoming documents and compare against the registry. Only documents with a new hash or a stale model version are re-embedded. This reduces cost dramatically - a corpus of 10M documents that sees 0.1% daily changes only re-embeds 10,000 documents per day instead of 10 million.

Full re-indexing is necessary in three scenarios: model version upgrade (all vectors must be regenerated in the new model's geometric space), chunking strategy change (existing chunks are structurally incompatible with new chunks), or catastrophic index corruption. In these cases, the dual-index migration pattern applies - build the new index in the background while the old index continues serving traffic.

The practical engineering challenge with incremental indexing is handling deletions. When a document is deleted, you need to remove its chunk embeddings from the vector store and its entries from the document registry. This requires soft-delete patterns in the ingestion pipeline - mark documents as deleted in a staging area, then run a cleanup job that queries the registry, retrieves embedding IDs, deletes them from the vector store, and removes the registry entry. Without explicit deletion handling, deleted documents remain searchable indefinitely - a subtle quality bug.

Q6: How do you design an embedding system for multi-tenant applications?

Answer:

Multi-tenancy in embedding systems has two main design choices: shared index with filtering, or separate index per tenant.

Shared index with tenant filtering is simpler operationally. You store a tenant_id field in each vector's payload. At query time, you add a filter condition that restricts results to the querying tenant's documents. Qdrant and Weaviate both support payload filtering that runs before or alongside ANN search. The downside: tenants with very large corpora can skew the index topology, and a single slow tenant query can affect others (noisy neighbor problem). Also, if one tenant's data is compromised, all tenants' data is in the same store.

Separate index per tenant gives perfect isolation - each tenant has their own Qdrant collection or Weaviate class. Queries are routed to the tenant's collection with no filter overhead. The tradeoff: collection management becomes complex at scale (10,000 tenants = 10,000 collections), and small tenants waste resources (Qdrant HNSW indexes have minimum memory overhead regardless of data size).

A hybrid approach: use separate collections for large tenants (above a threshold like 100K documents) and a shared collection with filtering for small tenants. The routing logic checks tenant size in a metadata store and directs the query to the appropriate index. This balances isolation for high-value tenants against operational simplicity for long-tail tenants.

For caching in multi-tenant systems, include tenant_id in the cache key alongside model version and text. This prevents cross-tenant cache pollution and allows per-tenant cache eviction when a tenant's data is deleted.

Module Summary

You have now completed the Embeddings Engineering module. The journey covered:

What embeddings are - dense vector representations that capture semantic meaning through contrastive learning
Embedding model landscape - SBERT, E5, BGE, GTE and how to select between them using MTEB
API embeddings - OpenAI, Voyage AI, Cohere, with practical batch processing and cost control
Fine-tuning - triplet loss, hard negative mining, synthetic data generation with GPL and TSDAE
Matryoshka embeddings - nested representations enabling adaptive precision retrieval
Evaluation - nDCG@10, Recall@K, MRR, and building domain-specific evaluation pipelines
Quantization - float16, int8, and binary quantization with the two-stage rescore pattern
Multimodal embeddings - CLIP, SigLIP, ImageBind, ColPali, and cross-modal search
Production systems - the complete pipeline from ingestion to query, with caching, incremental indexing, staleness management, vector DB selection, monitoring, and cost optimization

The gap between understanding embeddings and running them reliably in production is wide. The engineers who close that gap - who instrument for drift, design idempotent pipelines, handle model migrations gracefully, and keep costs proportional to value - are the ones building systems that survive contact with real users.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embeddings in Production demo on the EngineersOfAI Playground - no code required.

:::

Opening: The 3am PagerDuty Alert​

Why Production Embeddings Are Hard​

The Full Production Pipeline​

Chunking Strategy​

Embedding Caching​

Cache Hit Rate in Practice​

Async Batch Embedding Pipeline​

Running Multiple Workers​

Incremental Indexing​

Model Version Migration Strategy​

Staleness Management​

Vector Database Selection​

Vector DB Comparison​

Qdrant Production Setup​

pgvector for Smaller Scale​

Monitoring Embedding Quality in Production​

Online Metrics: Query-Level Monitoring​

Cost Optimization​

Cost Levers in Priority Order​

The Complete Production Architecture​

Reranking: The Final Quality Layer​

Common Production Mistakes​

Interview Q&A​

Q1: Explain the complete lifecycle of an embedding in a production RAG system, from document ingestion to query result.​

Q2: How do you safely migrate to a new embedding model without service disruption?​

Q3: Walk me through how you would debug a sudden drop in semantic search quality.​

Q4: How do you control embedding costs at scale without sacrificing quality?​

Q5: What is the difference between incremental indexing and full re-indexing, and when do you choose each?​

Q6: How do you design an embedding system for multi-tenant applications?​

Module Summary​

Opening: The 3am PagerDuty Alert

Why Production Embeddings Are Hard

The Full Production Pipeline

Chunking Strategy

Embedding Caching

Cache Hit Rate in Practice

Async Batch Embedding Pipeline

Running Multiple Workers

Incremental Indexing

Model Version Migration Strategy

Staleness Management

Vector Database Selection

Vector DB Comparison

Qdrant Production Setup

pgvector for Smaller Scale

Monitoring Embedding Quality in Production

Online Metrics: Query-Level Monitoring

Cost Optimization

Cost Levers in Priority Order

The Complete Production Architecture

Reranking: The Final Quality Layer

Common Production Mistakes

Interview Q&A

Q1: Explain the complete lifecycle of an embedding in a production RAG system, from document ingestion to query result.

Q2: How do you safely migrate to a new embedding model without service disruption?

Q3: Walk me through how you would debug a sudden drop in semantic search quality.

Q4: How do you control embedding costs at scale without sacrificing quality?

Q5: What is the difference between incremental indexing and full re-indexing, and when do you choose each?

Q6: How do you design an embedding system for multi-tenant applications?

Module Summary