Embedding Models Deep Dive

The Model Choice That Determines Your Retrieval Ceiling

The team had built a solid RAG pipeline. Their customers were generating contracts, and the system was supposed to help them query those contracts intelligently. After launch, the most common complaint was specific: the system reliably failed to retrieve the relevant clause when users asked about "termination for convenience." The clause was there, in every contract. The embedding model was just failing to map the user query to the contract language.

The root cause took a week to diagnose. The user said "termination for convenience." The contract said "unilateral right of withdrawal." Same concept, completely different vocabulary. A keyword search would have also failed. But the embedding model they'd chosen - text-embedding-ada-002 - was not trained on legal domain text. Its representation space had no learned notion that these phrases were semantically equivalent in a legal context. The embedding vectors were pointing in different directions.

Switching to a model fine-tuned on legal text (trained on case law and contract corpora) reduced the failure rate on this class of queries by 70%. The architecture didn't change. The chunking didn't change. The retrieval algorithm didn't change. The embedding model changed.

This lesson is about understanding what makes an embedding model good for retrieval, how to evaluate them, and how to make the right choice for your specific domain and constraints.

Why This Exists: The Vector Representation Problem

An embedding model solves a deceptively hard problem: map arbitrary text to a fixed-dimensional dense vector such that semantically similar texts have high cosine similarity.

"High cosine similarity for similar texts" sounds straightforward. But it requires the model to have learned:

Synonymy: "automobile" and "car" should be close
Paraphrase equivalence: "how do I reset my password" and "I forgot my login credentials" should be close
Cross-domain equivalence: "myocardial infarction" and "heart attack" should be close in medical contexts
Relevance asymmetry: a technical document about "photosynthesis" should be close to the query "how do plants make food" even though no words overlap

This is why you can't just use word frequency vectors (TF-IDF) for semantic search. TF-IDF has zero notion of synonymy - it operates purely on vocabulary overlap.

How Embedding Models Are Trained

Most modern text embedding models start as pretrained language models (BERT, RoBERTa, or a decoder-only transformer) and are then fine-tuned with a contrastive objective.

The core training signal: given a query $q$ and a positive passage $p^+$ (a passage that answers $q$ ), train the model such that: $\text{sim}(E(q), E(p^+)) \gg \text{sim}(E(q), E(p^-))$ where $p^-$ is a negative passage (irrelevant to $q$ ) and $\text{sim}$ is cosine similarity.

The most effective loss function is the InfoNCE (NT-Xent) contrastive loss: $\mathcal{L} = -\log \frac{\exp(\text{sim}(q, p^+) / \tau)}{\sum_{j} \exp(\text{sim}(q, p_j) / \tau)}$

where $\tau$ is a temperature parameter and the denominator sums over the positive and all in-batch negatives.

Hard negative mining is critical for model quality: easy negatives (completely off-topic passages) don't teach the model fine-grained distinctions. The best training datasets include hard negatives - passages that are topically related but not actually relevant to the query. Mining these from BM25 top results (which are lexically similar but may be semantically off) is the standard approach.

The MTEB Benchmark: The Definitive Evaluation

Massive Text Embedding Benchmark (MTEB) - released by HuggingFace in 2022 - is the standard leaderboard for comparing embedding models across retrieval and other tasks. When choosing an embedding model, always check MTEB first.

MTEB covers 8 task categories:

Retrieval (most relevant for RAG) - 15 datasets, measured by NDCG@10
Reranking - 4 datasets
Clustering - 11 datasets
Classification - 12 datasets
STS (Semantic Textual Similarity) - 10 datasets
Summarization - 1 dataset
Bitext Mining - 15 datasets
Pair Classification - 3 datasets

For RAG, focus on the Retrieval category. NDCG@10 (Normalized Discounted Cumulative Gain at 10) measures whether the most relevant documents appear in the top 10 results, with higher positions rewarded more.

:::tip MTEB Leaderboard URL Check the live leaderboard at huggingface.co/spaces/mteb/leaderboard - it's updated regularly as new models are released. Sort by "Retrieval (Avg)" for RAG use cases. :::

Model Families

OpenAI text-embedding-3

OpenAI's third-generation embedding models (released January 2024):

text-embedding-3-small: 1536 dimensions, $0.02/1M tokens. Strong general-purpose retrieval. The best value model for most applications.
text-embedding-3-large: 3072 dimensions, $0.13/1M tokens. Higher quality, especially on complex retrieval tasks. 6.5x more expensive.

Key features:

Matryoshka representation learning: can truncate to any dimension without retraining (see below)
Strong multilingual performance
Trained on web-scale diverse corpora
No explicit domain adaptation

When to use: general-purpose RAG on English or multilingual text where you don't need domain specialization and want the simplicity of a managed API.

E5 Family (Microsoft)

E5 (EmbEddings from bidirEctional Encoder rEpresentations) - a family from Microsoft Research:

e5-small-v2, e5-base-v2, e5-large-v2: varying size/quality trade-offs
multilingual-e5-large: 560M parameter multilingual model, excellent cross-lingual retrieval
e5-mistral-7b-instruct: uses Mistral-7B as the backbone, very strong performance, heavy resource requirement

Training approach: E5 uses a weakly supervised training stage (contrastive learning on web-crawled query-passage pairs) followed by fine-tuning on curated retrieval datasets. The e5-mistral-7b-instruct paper showed that large decoder-based LLMs make excellent embedding models when properly fine-tuned.

Query prefix convention: E5 requires prefixing queries with "query: " and passages with "passage: ". Skipping this degrades performance significantly.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-large")

# IMPORTANT: E5 requires task-specific prefixes
query = "query: What are the termination provisions?"
passages = [
    "passage: Either party may terminate this agreement with 30 days written notice.",
    "passage: The liability cap shall not exceed $1 million USD.",
]

query_emb = model.encode(query, normalize_embeddings=True)
passage_embs = model.encode(passages, normalize_embeddings=True)

import numpy as np
scores = query_emb @ passage_embs.T
print(scores)  # [0.87, 0.21] - correctly identifies termination passage

BGE Family (BAAI)

BGE (BAAI General Embedding) from the Beijing Academy of Artificial Intelligence:

BAAI/bge-large-en-v1.5: 560M parameters, consistently top-3 on MTEB retrieval
BAAI/bge-m3: flagship multilingual model, supports 100+ languages, handles up to 8192 tokens

BGE uses a similar prefix convention to E5 for retrieval tasks. BGE models are Apache-2.0 licensed - free for commercial use with no API costs.

GTE Family (Alibaba)

GTE (Generalized Text Embeddings) from Alibaba DAMO Academy:

thenlper/gte-large: 335M parameters, competitive with much larger models
Alibaba-NLP/gte-Qwen2-7B-instruct: Qwen2-7B backbone, state-of-the-art performance

Cohere Embed v3

Cohere's embedding API offers domain adaptation through input type specification:

import cohere

co = cohere.Client("your-api-key")

# Cohere Embed v3 supports explicit input types
# This dramatically improves retrieval quality
query_embedding = co.embed(
    texts=["What is the refund policy?"],
    model="embed-english-v3.0",
    input_type="search_query",   # optimized for short queries
).embeddings

doc_embedding = co.embed(
    texts=["Refund requests must be submitted within 30 days..."],
    model="embed-english-v3.0",
    input_type="search_document",  # optimized for passage retrieval
).embeddings

The input_type distinction matters: queries are short and interrogative; documents are longer and declarative. Cohere trains separate representations for each, which is why their model performs well on asymmetric retrieval tasks.

Bi-Encoders vs Cross-Encoders

This is one of the most important architectural distinctions in retrieval.

Bi-Encoders (Dual Encoders)

Both query and document are encoded independently. Similarity is computed from precomputed vectors. Retrieval is $O(\log n)$ with ANN indexing.

$\text{score}(q, d) = \text{cosine}(E_q(q), E_d(d))$

Advantage: Documents can be precomputed and indexed. Query-time cost is a single embedding + ANN lookup - sub-100ms at any scale.

Limitation: No interaction between query and document during encoding. The model must capture all relevance information in independent vectors. Complex relevance relationships (where understanding query requires understanding document simultaneously) are approximated, not captured exactly.

Cross-Encoders

Query and document are concatenated and processed together. The model outputs a relevance score directly.

$\text{score}(q, d) = \text{CrossEncoder}([q; d])$

Advantage: Full cross-attention between query and document. Far more accurate relevance scoring - the model can attend to query terms when processing document content and vice versa.

Limitation: Cannot be precomputed. Must process every (query, document) pair at query time - $O(n)$ complexity. At n=100K documents, this takes minutes per query. Only feasible as a reranker over a small candidate set (top 50-100 from a bi-encoder first stage).

Embedding Dimensions and Quality

More dimensions generally means more capacity to encode nuanced semantic distinctions. But the returns are diminishing, and higher dimensions have real costs:

Dimensions	Typical Use Case	Memory per 1M vectors	Index build time
384	Mobile, edge, low-cost retrieval	1.5 GB	Fast
768	General production RAG	3 GB	Moderate
1536	High-quality production RAG	6 GB	Moderate
3072	Maximum quality, cost-insensitive	12 GB	Slow

Memory cost matters: 1M vectors at 1536 dimensions (float32) requires 6 GB of RAM in the index. At 10M vectors, that's 60 GB - you'll need quantization (IVF-PQ) or a distributed index.

Matryoshka Representation Learning (MRL)

A significant advance from Kusupati et al. (2022), now adopted by OpenAI's embedding-3 models. The core insight: train the model such that the first $d$ dimensions of a $D$ -dimensional embedding are themselves a high-quality embedding at dimension $d$ .

This means a 1536-dimensional MRL embedding can be truncated to 256 dimensions and still perform well - there's no need to re-embed if you want a smaller representation.

$\mathcal{L}_{MRL} = \sum_{m \in M} \mathcal{L}(f_{1:m}(x), f_{1:m}(x^+), f_{1:m}(x^-))$

where $M$ is the set of target dimensions (e.g., {64, 128, 256, 512, 1024, 1536}) and $f_{1:m}$ means the first $m$ dimensions of the embedding.

Practical benefit: Store full 1536-dimension vectors. Use 256-dimension truncations for initial ANN retrieval (much faster, lower memory). Use full vectors for final reranking. A two-tier retrieval at lower cost with near-full-dimension quality.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_with_dimension(text: str, dimensions: int = 1536) -> np.ndarray:
    """Embed text, optionally truncating to fewer dimensions (MRL)."""
    response = client.embeddings.create(
        model="text-embedding-3-large",  # MRL-enabled
        input=text,
        dimensions=dimensions,  # API supports native truncation
    )
    return np.array(response.data[0].embedding)

# Full quality embedding
full_emb = embed_with_dimension("contract termination provisions", dimensions=3072)
print(f"Full: {full_emb.shape}")  # (3072,)

# Truncated for fast ANN retrieval (4x faster, similar quality on many tasks)
small_emb = embed_with_dimension("contract termination provisions", dimensions=256)
print(f"Truncated: {small_emb.shape}")  # (256,)

Domain-Specific Embedding Models

General-purpose models struggle with highly specialized vocabulary where:

Technical terms have specific domain meanings (e.g., "discharge" means different things in medical vs electrical vs legal contexts)
Domain-specific equivalences aren't in general training data
The corpus uses specialized abbreviations and conventions

Code embeddings: microsoft/codebert-base or flax-community/gte-large fine-tuned on code understand function signatures, variable names, and code structure. For code retrieval, general embeddings work but specialized models consistently outperform by 10-20% on benchmarks.

Multilingual: intfloat/multilingual-e5-large and BAAI/bge-m3 support 100+ languages with strong cross-lingual retrieval (query in English, find documents in French).

Scientific/medical: SPECTER2 (Allen AI) for scientific papers, BioLORD for biomedical concepts.

Fine-Tuning Embedding Models

When general models underperform on your domain, fine-tuning is the answer. The process:

Build a training dataset: (query, positive passage, hard negatives) triples
Choose a base model: Start from a strong general model like bge-large-en-v1.5
Fine-tune with contrastive loss: Use sentence-transformers training pipeline
Evaluate on held-out retrieval set

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from datasets import Dataset

# Training data: list of (query, positive_passage) pairs
# Hard negatives dramatically improve fine-tuning quality
train_data = [
    {
        "anchor": "What is the notice period for termination?",
        "positive": "Either party may terminate with 30 days written notice.",
        "negative": "Liability is capped at the contract value.",  # hard negative
    },
    # ... hundreds or thousands more examples
]

dataset = Dataset.from_list(train_data)

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="./my-legal-embedding-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    evaluation_strategy="steps",
    eval_steps=200,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    loss=loss,
)
trainer.train()
model.save_pretrained("./my-legal-embedding-model")

Training data generation with LLMs: If you lack query-passage pairs, use an LLM to generate synthetic training questions from your document corpus. Prompt: "Given the following passage, generate 3 questions that this passage would answer." This is the key technique for domain-specific fine-tuning without labeled data.

Model Comparison on a Real Retrieval Task

import numpy as np
from openai import OpenAI
from sentence_transformers import SentenceTransformer

client = OpenAI()

# Test corpus: legal contract sections
corpus = [
    "The agreement may be terminated by either party with 30 days written notice.",
    "All intellectual property created under this contract belongs to the client.",
    "Disputes shall be resolved through binding arbitration in New York.",
    "Contractor is an independent contractor and not an employee of the client.",
    "Payment is due within 30 days of invoice receipt.",
]

# Test queries and expected best match
test_cases = [
    ("How much notice is required to cancel the contract?", 0),
    ("Who owns the work product?", 1),
    ("Where are legal disputes handled?", 2),
    ("What is the employment status of the contractor?", 3),
    ("When must the client pay invoices?", 4),
]

def evaluate_model(model_name: str, encode_fn, query_prefix: str = "", doc_prefix: str = "") -> float:
    """Returns recall@1 (whether top result is correct)."""
    doc_embs = np.array([encode_fn(doc_prefix + doc) for doc in corpus])
    correct = 0
    for query, expected_idx in test_cases:
        q_emb = np.array(encode_fn(query_prefix + query))
        scores = doc_embs @ q_emb / (
            np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(q_emb)
        )
        predicted_idx = np.argmax(scores)
        if predicted_idx == expected_idx:
            correct += 1
    recall_at_1 = correct / len(test_cases)
    print(f"{model_name}: Recall@1 = {recall_at_1:.2f} ({correct}/{len(test_cases)})")
    return recall_at_1

# OpenAI embedding-3-small
def openai_embed(text: str) -> list:
    return client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

evaluate_model("text-embedding-3-small", openai_embed)

# BGE-large
bge = SentenceTransformer("BAAI/bge-large-en-v1.5")
evaluate_model(
    "bge-large-en-v1.5",
    lambda t: bge.encode(t, normalize_embeddings=True).tolist(),
    query_prefix="Represent this sentence for searching relevant passages: ",
)

# E5-large
e5 = SentenceTransformer("intfloat/e5-large-v2")
evaluate_model(
    "e5-large-v2",
    lambda t: e5.encode(t, normalize_embeddings=True).tolist(),
    query_prefix="query: ",
    doc_prefix="passage: ",
)

Embedding Model Selection Decision Tree

Production Engineering Notes

Embedding cache: For repeated queries (common in production), cache embeddings in Redis with a TTL. Query embedding costs $0.02/1M tokens - at 100K queries/day averaging 20 tokens each, that's$ 0.04/day. Still, caching the top 1000 most frequent queries saves latency.

Batch embedding: Never call the embedding API one chunk at a time. Batch embedding is 10-50x faster. text-embedding-3-small handles up to 2048 inputs per API call.

def batch_embed(texts: list, batch_size: int = 512) -> list:
    """Embed in batches to stay within API limits."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        all_embeddings.extend([d.embedding for d in response.data])
    return all_embeddings

Normalization: Always L2-normalize embeddings before storing and before similarity computation. Cosine similarity on unnormalized vectors requires two norm computations per comparison. Pre-normalizing reduces this to a single dot product, which is what ANN indexes exploit.

Common Mistakes

:::danger Ignoring Model-Specific Prefixes E5 and BGE models require task-specific prefixes: "query: " for queries and "passage: " for documents. Skipping these can reduce retrieval quality by 5-15% compared to correctly prefixed inputs. Always check the model card for the recommended prefix convention. :::

:::danger Comparing Models Without Domain-Specific Eval MTEB scores are averages over 15+ diverse retrieval datasets. Your domain may differ significantly. A model that ranks #5 on MTEB may outperform the #1 model on your specific corpus. Always run a held-out eval on 100-200 query-answer pairs from your actual data before committing to a model. :::

:::warning Not Normalizing Embeddings Most embedding models output vectors that need L2 normalization for cosine similarity computation. Some models return pre-normalized vectors; many don't. If you're using dot product similarity in your vector DB (common for performance reasons), unnormalized vectors will give incorrect results. Use normalize_embeddings=True in SentenceTransformers or normalize manually. :::

:::warning Mixing Embedding Models Never use embedding model A to index documents and embedding model B to embed queries. The vector spaces are incompatible - you'll get random results. If you switch models, you must re-embed the entire corpus. Version your embeddings with model metadata. :::

Interview Questions and Answers

Q: What is the difference between a bi-encoder and a cross-encoder, and when do you use each?

A: A bi-encoder encodes query and document independently, then computes similarity between the resulting vectors. Documents can be precomputed and indexed - query-time cost is $O(\log n)$ with ANN. A cross-encoder concatenates query and document and processes them together, outputting a relevance score. This allows full cross-attention but requires processing every (query, document) pair at query time - $O(n)$ complexity. In production RAG, you always use both: bi-encoder for first-stage retrieval (fast, returns top 50-100 candidates from millions of documents), cross-encoder for second-stage reranking (accurate, runs on the small candidate set). Running a cross-encoder over all documents is computationally infeasible at scale.

Q: Explain Matryoshka Representation Learning and why it matters for production systems.

A: MRL trains the model with a multi-scale loss that ensures the first $d$ dimensions of a $D$ -dimensional embedding are themselves a useful embedding at dimension $d$ . This is done by applying the contrastive loss at multiple dimension cutoffs simultaneously during training. The result: a single model that produces embeddings that are valid at any dimension truncation. In production, this enables a two-tier strategy: build a fast ANN index using truncated 256-dimension embeddings (4x cheaper storage and 4x faster ANN search), then rerank using the full 1536-dimension embeddings. OpenAI's embedding-3 models support this natively via the dimensions API parameter.

Q: Your RAG system works well for English but fails on Spanish queries even though you have Spanish documents. What's wrong and how do you fix it?

A: The embedding model doesn't have adequate multilingual support. A model trained only on English will map English queries and Spanish documents to incompatible regions of the vector space - their cosine similarity won't reflect semantic equivalence. Fix: switch to a multilingual model like multilingual-e5-large or bge-m3, which are trained on parallel multilingual corpora with cross-lingual contrastive objectives. These models learn that "termination of contract" in English and "rescisión del contrato" in Spanish should have high cosine similarity. Re-embed your entire corpus with the new model. Evaluate with cross-lingual query-document pairs from your actual data.

Q: How would you build a training dataset for fine-tuning an embedding model on a specialized domain?

A: The most practical approach when you lack labeled data is LLM-based synthetic data generation. Step 1: take 1000-5000 passages from your document corpus. Step 2: for each passage, prompt an LLM to generate 3-5 questions that this passage directly answers. Step 3: for each (question, passage) positive pair, mine hard negatives - use BM25 to find passages that are lexically similar but semantically different. Step 4: optionally use a cross-encoder to filter out false negatives (passages that actually are relevant but appear as negatives). Step 5: fine-tune a base model (e.g., bge-large-en-v1.5) using MultipleNegativesRankingLoss with this dataset. Typical training data requirements: 10K-100K pairs for meaningful improvement. Evaluate on a held-out set with Recall@1, Recall@5, NDCG@10.

Q: Between text-embedding-3-small and a self-hosted bge-large, which would you choose for a production RAG system and why?

A: It depends on three factors. Latency: the API call adds 20-80ms of network overhead; self-hosted GPU inference is 5-15ms. Cost at scale: at 100M tokens/month, text-embedding-3-small costs $2, while a single A10G GPU at$ 1/hr handles roughly 500M tokens/day - API is cheaper until very high volume. Compliance: if your documents are confidential (medical records, legal files), sending them to OpenAI's API may violate compliance requirements - self-hosted is mandatory. Privacy-sensitive deployments almost always go self-hosted, with BGE models being the standard choice (Apache-2.0, excellent quality, runs on consumer GPUs). For prototypes and moderate-scale systems, the API is simpler and often cost-competitive.

Embedding Model Deployment Patterns

Managed API Deployment

The simplest production pattern: call OpenAI, Cohere, or Voyage AI's embedding API for both indexing and queries.

import openai
import numpy as np
from typing import List
import time

client = openai.OpenAI()

def embed_with_retry(
    texts: List[str],
    model: str = "text-embedding-3-small",
    max_retries: int = 3,
    batch_size: int = 512,
) -> np.ndarray:
    """
    Production-grade embedding with batching, retry, and rate limit handling.
    """
    all_embeddings = []

    for batch_start in range(0, len(texts), batch_size):
        batch = texts[batch_start:batch_start + batch_size]

        for attempt in range(max_retries):
            try:
                response = client.embeddings.create(
                    model=model,
                    input=batch,
                )
                batch_embeddings = [item.embedding for item in response.data]
                all_embeddings.extend(batch_embeddings)
                break  # Success, move to next batch
            except openai.RateLimitError:
                if attempt < max_retries - 1:
                    wait_time = 2 ** attempt  # Exponential backoff
                    print(f"Rate limit hit, waiting {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    raise
            except openai.APIError as e:
                if attempt < max_retries - 1:
                    time.sleep(1)
                else:
                    raise

    embeddings = np.array(all_embeddings, dtype=np.float32)
    # Normalize for cosine similarity
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

Self-Hosted GPU Deployment

For compliance, privacy, or cost-at-scale requirements, deploy embedding models on your own GPU infrastructure:

from sentence_transformers import SentenceTransformer
import torch
import numpy as np
from typing import List

class SelfHostedEmbedder:
    def __init__(
        self,
        model_name: str = "BAAI/bge-large-en-v1.5",
        device: str = None,
        batch_size: int = 128,
    ):
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Loading {model_name} on {self.device}...")
        self.model = SentenceTransformer(model_name, device=self.device)
        self.batch_size = batch_size
        self.query_prefix = "Represent this sentence for searching relevant passages: "

    def embed_documents(self, texts: List[str]) -> np.ndarray:
        """Batch embed documents for indexing."""
        embeddings = self.model.encode(
            texts,
            batch_size=self.batch_size,
            normalize_embeddings=True,
            show_progress_bar=True,
        )
        return embeddings

    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single query with appropriate prefix."""
        prefixed = self.query_prefix + query
        embedding = self.model.encode(
            [prefixed],
            normalize_embeddings=True,
        )
        return embedding[0]

    def get_throughput_stats(self) -> dict:
        """Benchmark embedding throughput."""
        import time
        test_texts = ["This is a test sentence for benchmarking."] * 512
        t0 = time.time()
        self.embed_documents(test_texts)
        elapsed = time.time() - t0
        return {
            "texts_per_second": 512 / elapsed,
            "ms_per_text": elapsed * 1000 / 512,
            "device": self.device,
        }


# Production deployment: run as a FastAPI service
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
embedder = SelfHostedEmbedder()

class EmbedRequest(BaseModel):
    texts: List[str]
    is_query: bool = False

@app.post("/embed")
async def embed(req: EmbedRequest):
    if req.is_query and len(req.texts) == 1:
        emb = embedder.embed_query(req.texts[0])
        return {"embeddings": [emb.tolist()]}
    else:
        embs = embedder.embed_documents(req.texts)
        return {"embeddings": embs.tolist()}

Embedding Model Versioning and Migration

A critical production concern: what happens when you switch embedding models?

The core problem: All documents in your vector index were embedded with Model A. If you switch to Model B, the query vector (from Model B) is in a different vector space than the document vectors (from Model A). Search results are meaningless.

Migration strategy:

Dual indexing (no downtime): Run two parallel indexes. New documents go into both indexes. As time allows, re-embed old documents into the new index. Gradually shift query traffic to the new index as re-indexing progresses.
Full re-index (with downtime): Take the system offline, re-embed all documents with the new model, rebuild the index, bring it back online. Acceptable for smaller corpora or when a maintenance window is available.
Shadow evaluation (before migration): Before committing to a model switch, run a shadow evaluation: embed your golden dataset queries with both old and new models, measure recall@5 on each. Only migrate if the new model meaningfully outperforms the old one on your specific data.

import json
from pathlib import Path

def verify_embedding_version(vector_store_path: str) -> dict:
    """
    Read the model version metadata from a vector store.
    Always store embedding model metadata alongside your index.
    """
    metadata_path = Path(vector_store_path) / "embedding_metadata.json"
    if not metadata_path.exists():
        return {"warning": "No embedding metadata found - risky to proceed"}

    with open(metadata_path) as f:
        return json.load(f)

def save_embedding_metadata(vector_store_path: str, model_name: str, dim: int):
    """Always save which model was used to build an index."""
    metadata = {
        "model_name": model_name,
        "embedding_dim": dim,
        "created_at": __import__("datetime").datetime.utcnow().isoformat(),
        "version": "1.0",
    }
    metadata_path = Path(vector_store_path) / "embedding_metadata.json"
    with open(metadata_path, "w") as f:
        json.dump(metadata, f, indent=2)

:::danger Version Mismatch - The Silent Killer Switching embedding models without re-indexing produces results that look plausible but are subtly wrong - the ANN index returns neighbors in the old embedding space that are not actually semantically similar under the new model. This is one of the hardest bugs to detect because the system continues to return results, just meaningfully worse ones. Always tag your index with the model version and assert at query time that the query model matches the index model. :::

Embedding Quality Evaluation Checklist

Before deploying an embedding model to production, verify these properties on your specific data:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def evaluate_embedding_quality(
    model,
    positive_pairs: list,    # [(query, relevant_doc), ...]
    negative_pairs: list,    # [(query, irrelevant_doc), ...]
) -> dict:
    """
    Evaluate embedding model quality on your domain data.
    positive_pairs: query-document pairs that should be similar
    negative_pairs: query-document pairs that should NOT be similar
    """
    pos_scores = []
    for query, pos_doc in positive_pairs:
        q_emb = model.encode([query], normalize_embeddings=True)
        d_emb = model.encode([pos_doc], normalize_embeddings=True)
        score = float(cosine_similarity(q_emb, d_emb)[0][0])
        pos_scores.append(score)

    neg_scores = []
    for query, neg_doc in negative_pairs:
        q_emb = model.encode([query], normalize_embeddings=True)
        d_emb = model.encode([neg_doc], normalize_embeddings=True)
        score = float(cosine_similarity(q_emb, d_emb)[0][0])
        neg_scores.append(score)

    # Gap: positive scores should be significantly higher than negative
    gap = np.mean(pos_scores) - np.mean(neg_scores)

    # Recall@1: is the positive always ranked above the negative?
    recall_at_1 = sum(
        p > n for p, n in zip(pos_scores, neg_scores)
    ) / len(pos_scores)

    return {
        "mean_positive_similarity": np.mean(pos_scores),
        "mean_negative_similarity": np.mean(neg_scores),
        "similarity_gap": gap,
        "recall_at_1": recall_at_1,
        "recommendation": "Good" if gap > 0.2 and recall_at_1 > 0.8 else "Needs improvement",
    }


# Example evaluation
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Your domain-specific test pairs
positive_pairs = [
    ("What is the refund window?", "Unused items can be returned within 30 days."),
    ("How long does shipping take?", "Standard shipping takes 3-5 business days."),
]
negative_pairs = [
    ("What is the refund window?", "Express shipping delivers in 1-2 days."),
    ("How long does shipping take?", "Items must be in original packaging for returns."),
]

results = evaluate_embedding_quality(model, positive_pairs, negative_pairs)
print(results)

A similarity gap above 0.2 and recall@1 above 0.8 on your domain pairs indicates the model is well-suited for your use case. Below these thresholds, consider domain-specific fine-tuning or a different base model.

Embedding Versioning and Migration

A production RAG system will eventually need to change embedding models - either because a better model releases, because the old model is deprecated, or because domain fine-tuning has improved quality. This is non-trivial: all embeddings in your vector database were created with the old model, and old and new embeddings live in different geometric spaces (you cannot mix-and-match them).

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
import numpy as np

class EmbeddingMigrator:
    """
    Handles zero-downtime migration from one embedding model to another.
    Strategy: maintain two collections, migrate in background, cut over atomically.
    """

    def __init__(self, qdrant_client: QdrantClient):
        self.client = qdrant_client

    def migrate_collection(
        self,
        source_collection: str,
        target_collection: str,
        old_embedder,
        new_embedder,
        batch_size: int = 100,
    ) -> dict:
        """
        Re-embed all documents from source collection into target collection.
        Step 1: Create target collection with new vector dimension.
        Step 2: Scroll source collection, re-embed, upsert to target.
        Step 3: (caller responsibility) cut traffic over to target collection.
        """
        # Get source collection info
        source_info = self.client.get_collection(source_collection)
        total_points = source_info.points_count

        # Create target collection with new embedding dimension
        sample_embedding = new_embedder.embed(["test"])[0]
        new_dim = len(sample_embedding)

        self.client.recreate_collection(
            collection_name=target_collection,
            vectors_config=VectorParams(size=new_dim, distance=Distance.COSINE),
        )

        migrated = 0
        offset = None

        while True:
            # Scroll a batch of records from the source
            records, next_offset = self.client.scroll(
                collection_name=source_collection,
                offset=offset,
                limit=batch_size,
                with_vectors=False,  # Don't fetch old vectors - we'll re-embed text
                with_payload=True,
            )

            if not records:
                break

            # Re-embed the text content
            texts = [r.payload.get("text", "") for r in records]
            new_vectors = new_embedder.embed(texts)

            # Upsert to target collection with same IDs and payload
            from qdrant_client.models import PointStruct
            points = [
                PointStruct(id=r.id, vector=vec.tolist(), payload=r.payload)
                for r, vec in zip(records, new_vectors)
            ]
            self.client.upsert(collection_name=target_collection, points=points)

            migrated += len(records)
            print(f"Migrated {migrated}/{total_points} ({100*migrated/total_points:.1f}%)")

            offset = next_offset
            if next_offset is None:
                break

        return {
            "migrated": migrated,
            "source": source_collection,
            "target": target_collection,
            "new_dimension": new_dim,
        }

The zero-downtime migration strategy: (1) continue serving the old collection while migration runs in the background; (2) once migration is complete, run your golden dataset eval on the new collection and compare recall@5; (3) only if the new model improves quality, atomically flip the collection pointer (a single config change) and deprecate the old collection after one week of stable operation. Never delete the old collection immediately - you need it for rollback.

Interview Questions and Answers

Q: What is the MTEB benchmark and why does it matter for embedding model selection?

A: The Massive Text Embedding Benchmark (Muennighoff et al., 2022) is the standard evaluation suite for embedding models. It covers 56 tasks across 8 categories: retrieval, reranking, classification, clustering, pair classification, summarization, semantic textual similarity (STS), and bitext mining. Each task tests a different aspect of embedding quality. A model that tops MTEB overall may be weak on the retrieval sub-tasks, which are the most relevant for RAG. When evaluating models for RAG specifically, look at retrieval task performance: BEIR benchmark scores (NDCG@10 across 18 diverse retrieval datasets). MTEB matters because it provides a standardized, reproducible comparison - but it doesn't substitute for evaluating on your own data. A model ranked 8th on MTEB that was trained on domain-similar data will often outperform the top-1 model on your specific queries.

Q: Explain the difference between bi-encoders and cross-encoders. When would you use each in a RAG system?

A: Bi-encoders encode the query and document independently into fixed-size vectors. At query time, you embed the query and do an ANN search over pre-computed document embeddings. This is fast (milliseconds) because document embeddings are computed once offline. The limitation: because query and document are encoded independently, the model cannot directly compare them - it can only measure cosine distance between independent representations, which loses fine-grained interaction signals. Cross-encoders take the concatenated query and document as a single input and output a relevance score. This full cross-attention between query and document captures nuanced relevance. The cost: cross-encoders must score each query-document pair individually at query time - you can't pre-compute document embeddings. This means they're 100-1000x slower and can only score tens to hundreds of documents per second, not millions. The standard RAG architecture uses both: bi-encoder for first-stage ANN retrieval (fast, retrieve top-100 candidates), cross-encoder for second-stage reranking (accurate, rerank top-100 to top-5). This two-stage design gets most of the accuracy benefit of cross-encoders while remaining fast enough for real-time queries.

Q: What is Matryoshka Representation Learning and what practical benefit does it offer?

A: Matryoshka Representation Learning (Kusupati et al., 2022) - named after Russian nesting dolls - trains embedding models so that the first $d$ dimensions of a 1536-dimensional embedding are themselves a high-quality $d$ -dimensional embedding. The model explicitly optimizes embedding quality at multiple dimensions simultaneously during training: 1536, 768, 512, 256, 128, 64. The practical benefit: you can truncate embeddings at query time based on your latency/cost/quality trade-off. For a use case where you need fast approximate retrieval with slightly lower accuracy, use 256-dimensional embeddings - 6x smaller, 6x faster ANN search, 6x lower memory - and only pay the quality penalty you're willing to accept. For high-stakes retrieval, use the full 1536 dimensions. The same model serves both use cases. OpenAI's text-embedding-3-small and text-embedding-3-large both support MRL - you can request 256-dimensional embeddings from the API and get a well-formed embedding, not just a truncated one.

Q: You're building a RAG system for a biomedical literature search tool. Which embedding model would you start with and why?

A: For biomedical literature, domain-specific models substantially outperform general-purpose models. Start with one of: (1) BAAI/bge-large-en-v1.5 - strong general-purpose model, good baseline on scientific text; (2) microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract - trained on PubMed abstracts, strong on biomedical terminology; (3) ncats/biobert-large-cased-v1.1 - fine-tuned on biomedical QA tasks. Practically: first establish a baseline with text-embedding-3-large (OpenAI) because it's easy and good on diverse text, including scientific. Then evaluate a biomedical-specific model on 100 representative query-document pairs from your corpus. The key evaluation: measure recall@5 on the biomedical pairs with each model. If the domain-specific model improves recall@5 by more than 3 percentage points, it's worth the operational overhead of self-hosting. The risk of general models: biomedical abbreviations (HTN = hypertension, DM2 = type 2 diabetes, MI = myocardial infarction) are underrepresented in general training corpora, causing vocabulary mismatch in embedding space. Domain-specific models close this gap.

Q: A team is considering fine-tuning an embedding model on their domain data. What are the prerequisites, risks, and when is it worth the effort?

A: Prerequisites: You need (1) labeled training pairs - at minimum 1,000-5,000 query-document positive pairs, ideally with hard negatives (documents that look relevant but aren't). This is the biggest bottleneck - labeling is expensive. (2) Compute: fine-tuning a 335M parameter model like BGE-large requires 8+ GB GPU VRAM for a single epoch; a full fine-tuning run needs 4-16 hours on a V100 or A100. (3) Eval set: 200+ labeled pairs held out, never seen during training. Risks: (1) Catastrophic forgetting - fine-tuning may improve domain recall while degrading performance on general queries. Test on both domain and general eval sets. (2) Overfitting to training distribution - fine-tuned models can be brittle to new terminology. (3) Operational cost - you now own model hosting, versioning, and retraining. When it's worth it: your domain evaluation shows the best general-purpose model achieves recall@5 below 0.75, you have 2,000+ labeled pairs, and you have the infrastructure for model serving. When to skip it: your domain evaluation shows any general-purpose model achieving recall@5 above 0.85 - the marginal gain from fine-tuning is unlikely to justify the engineering overhead. The middle path: retrieval augmentation (HyDE, query expansion) often improves recall by 5-10 percentage points at zero additional model cost - try these before committing to fine-tuning.

Summary

Embedding models are the first layer of quality in your RAG system. Everything downstream - retrieval recall, reranking, generation quality - is bounded by embedding quality. The highest-leverage decisions: (1) evaluate on your own domain data, not just MTEB; (2) consider the bi-encoder + cross-encoder two-stage architecture for precision-critical applications; (3) use Matryoshka-enabled models (text-embedding-3, BGE-M3) for flexible dimension trade-offs; (4) cache embeddings aggressively to control API cost at scale; (5) only fine-tune when general models genuinely fall short on domain evaluation. The next lesson covers where these embeddings live: vector databases and the indexing structures that make billion-scale retrieval fast.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Model Choice That Determines Your Retrieval Ceiling​

Why This Exists: The Vector Representation Problem​

How Embedding Models Are Trained​

The MTEB Benchmark: The Definitive Evaluation​

Model Families​

OpenAI text-embedding-3​

E5 Family (Microsoft)​

BGE Family (BAAI)​

GTE Family (Alibaba)​

Cohere Embed v3​

Bi-Encoders vs Cross-Encoders​

Bi-Encoders (Dual Encoders)​

Cross-Encoders​

Embedding Dimensions and Quality​

Matryoshka Representation Learning (MRL)​

Domain-Specific Embedding Models​

Fine-Tuning Embedding Models​

Model Comparison on a Real Retrieval Task​

Embedding Model Selection Decision Tree​

Production Engineering Notes​

Common Mistakes​

Interview Questions and Answers​

Embedding Model Deployment Patterns​

Managed API Deployment​

Self-Hosted GPU Deployment​

Embedding Model Versioning and Migration​

Embedding Quality Evaluation Checklist​

Embedding Versioning and Migration​

Interview Questions and Answers​

Summary​

The Model Choice That Determines Your Retrieval Ceiling

Why This Exists: The Vector Representation Problem

How Embedding Models Are Trained

The MTEB Benchmark: The Definitive Evaluation

Model Families

OpenAI text-embedding-3

E5 Family (Microsoft)

BGE Family (BAAI)

GTE Family (Alibaba)

Cohere Embed v3

Bi-Encoders vs Cross-Encoders

Bi-Encoders (Dual Encoders)

Cross-Encoders

Embedding Dimensions and Quality

Matryoshka Representation Learning (MRL)

Domain-Specific Embedding Models

Fine-Tuning Embedding Models

Model Comparison on a Real Retrieval Task

Embedding Model Selection Decision Tree

Production Engineering Notes

Common Mistakes

Interview Questions and Answers

Embedding Model Deployment Patterns

Managed API Deployment

Self-Hosted GPU Deployment

Embedding Model Versioning and Migration

Embedding Quality Evaluation Checklist

Embedding Versioning and Migration

Interview Questions and Answers

Summary