MTEB benchmark deep dive, nDCG@10, Recall@K, MRR, MAP, building domain-specific evaluation sets, running MTEB locally, and avoiding the contamination problem.

How does nDCG@10 work in practice?

Evaluating Embedding Models covers MTEB, nDCG@10, embedding evaluation from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/embeddings-engineering/embedding-evaluation

What is the difference between MTEB and embedding evaluation?

See the full breakdown at https://engineersofai.com/docs/llms/embeddings-engineering/embedding-evaluation

Evaluating Embedding Models

Reading time: 23 min | Relevance: AI Engineer, ML Engineer, Research Engineer

The Model That Scored Well on Everything Except What Mattered

An e-commerce team needs an embedding model for product search. They look at MTEB. The top model scores 64.2 on the MTEB leaderboard. They deploy it. Product search quality improves by a small amount over keyword search. But their key metric - conversion rate on search results - doesn't improve meaningfully.

Two months later, a team member builds a simple evaluation set: 500 real user queries from their search logs with human-labeled relevant products from their catalog. They run all the top MTEB models on this set. The MTEB leaderboard winner gets Recall@10 = 0.61. The third-ranked MTEB model gets Recall@10 = 0.78. On their actual task, the third-ranked model is dramatically better. They switch, and conversion rate improves 12%.

The lesson: MTEB measures general embedding quality. Your task is not general. The MTEB leaderboard is a starting point for model selection, not the final word. This lesson teaches you how to read evaluation metrics correctly, run MTEB yourself on custom tasks, and build a domain-specific evaluation set that measures what actually matters for your application.

Why Evaluation Is Hard for Embeddings

Evaluating embedding models is harder than evaluating discriminative models because:

The output is not directly meaningful: An embedding vector doesn't have a human-interpretable value. You can only evaluate it in the context of a downstream task (retrieval, clustering, classification).

Task diversity: Embeddings are used in many different tasks - retrieval, semantic similarity, clustering, classification. An embedding model that's excellent for retrieval may be mediocre for clustering.

Query-document asymmetry: Retrieval models must place semantically similar queries and documents near each other, even though they're written in different styles. This is hard to evaluate without labeled query-document pairs.

Long-tail failures: Average metrics can look good while the model fails systematically on specific query types or domains. You need enough examples from all important subsets to catch these failures.

MTEB: The Standard Evaluation Framework

MTEB (Massive Text Embedding Benchmark, Muennighoff et al. 2022) evaluates embedding models across 56 datasets covering 8 task types. It's the standard comparison tool in the field.

The 8 task types

Retrieval (most important for RAG and search): Given a query, retrieve the most relevant documents from a corpus. Datasets include MSMARCO (web search), HotpotQA (multi-hop Q&A), NFCorpus (medical retrieval), ArguAna (argument retrieval), and more. Primary metric: nDCG@10.

Clustering: Group texts by topic without labels. Datasets span ArXiv paper clusters to Reddit thread clusters to news article topics. Primary metric: V-measure.

Classification: Classify texts into predefined categories. Datasets include sentiment analysis, topic classification, and emotion recognition. Primary metric: Accuracy or F1.

STS (Semantic Textual Similarity): Rate how similar two sentences are on a continuous scale. Datasets like STS-B, SICK-R. Primary metric: Spearman correlation with human ratings.

Reranking: Given a query and a list of candidate documents, reorder them by relevance. Primary metric: MAP (Mean Average Precision).

nDCG@K: The Key Retrieval Metric

nDCG@K (Normalized Discounted Cumulative Gain at K) is the primary metric for retrieval evaluation. Understanding it is essential for reading any retrieval evaluation report.

Motivation

Simple metrics like Recall@K ("did we retrieve the relevant document in the top K results?") treat all positions equally - retrieval at rank 1 is as good as retrieval at rank K. But in practice, users click on rank 1 far more than rank K. nDCG rewards systems that put relevant documents at higher ranks.

DCG (Discounted Cumulative Gain)

For a ranking of K documents, where $\text{rel}(i)$ is the relevance score of the document at rank $i$ :

$\text{DCG}@K = \sum_{i=1}^{K} \frac{\text{rel}(i)}{\log_2(i + 1)}$

The denominator $\log_2(i + 1)$ is the "discount" - rank 1 has discount $\log_2(2) = 1$ (no discount), rank 7 has discount $\log_2(8) = 3$ (much larger discount).

For binary relevance (relevant/not-relevant): $\text{rel}(i) \in \{0, 1\}$ .

IDCG and nDCG

IDCG (Ideal DCG) is the DCG of the perfect ranking - all relevant documents at the top positions.

$\text{nDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}$

nDCG is bounded in $[0, 1]$ . A score of 1.0 means all relevant documents are at the top of the ranking. A score of 0 means no relevant documents appear in the top K results.

import numpy as np

def dcg_at_k(relevances: list[float], k: int) -> float:
    """
    Compute DCG@K given relevance scores for a ranked list.
    relevances[i] = relevance score of document at rank i+1 (0-indexed).
    """
    k = min(k, len(relevances))
    relevances = np.array(relevances[:k])
    ranks = np.arange(1, k + 1)  # 1-indexed ranks
    discounts = np.log2(ranks + 1)
    return float(np.sum(relevances / discounts))

def ndcg_at_k(relevances: list[float], k: int) -> float:
    """Compute nDCG@K. relevances are the relevance scores in retrieval order."""
    actual_dcg = dcg_at_k(relevances, k)
    ideal_dcg = dcg_at_k(sorted(relevances, reverse=True), k)
    if ideal_dcg == 0:
        return 0.0
    return actual_dcg / ideal_dcg

# Example:
# Query: "heart attack symptoms"
# Top-10 retrieved documents, relevance scores (1=relevant, 0=not):
retrieved_relevances = [1, 0, 0, 1, 0, 0, 0, 1, 0, 0]

ndcg = ndcg_at_k(retrieved_relevances, k=10)
print(f"nDCG@10 = {ndcg:.4f}")  # Lower than 1.0 because relevant docs not at top


# Full evaluation: mean nDCG@10 across all queries
def evaluate_retrieval(
    queries: list[str],
    relevant_doc_ids: list[set[str]],  # Set of relevant doc IDs for each query
    retrieved_doc_ids: list[list[str]],  # Retrieved doc IDs (ranked) for each query
    k: int = 10,
) -> dict:
    """
    Compute retrieval evaluation metrics across all queries.
    """
    ndcg_scores = []
    recall_scores = []
    mrr_scores = []

    for q_relevant, q_retrieved in zip(relevant_doc_ids, retrieved_doc_ids):
        retrieved_k = q_retrieved[:k]

        # nDCG@K: binary relevance (1 if retrieved doc is relevant)
        relevances = [1 if doc_id in q_relevant else 0 for doc_id in retrieved_k]
        ndcg_scores.append(ndcg_at_k(relevances, k))

        # Recall@K: fraction of relevant docs retrieved in top K
        n_relevant_retrieved = sum(1 for doc_id in retrieved_k if doc_id in q_relevant)
        recall_scores.append(n_relevant_retrieved / max(1, len(q_relevant)))

        # MRR: 1/rank_of_first_relevant_document
        mrr = 0.0
        for rank, doc_id in enumerate(q_retrieved, 1):
            if doc_id in q_relevant:
                mrr = 1.0 / rank
                break
        mrr_scores.append(mrr)

    return {
        f"nDCG@{k}": np.mean(ndcg_scores),
        f"Recall@{k}": np.mean(recall_scores),
        "MRR": np.mean(mrr_scores),
        "n_queries": len(queries),
    }

Recall@K, MRR, and MAP

Recall@K

$\text{Recall}@K = \frac{|\text{relevant} \cap \text{retrieved top-}K|}{|\text{relevant}|}$

Recall@K answers: "What fraction of the relevant documents did we retrieve in the top K results?" For RAG applications with single relevant document per query, Recall@1 (did we get the right document at rank 1?) is particularly important.

Common values for RAG evaluation: Recall@1, Recall@5, Recall@10.

MRR (Mean Reciprocal Rank)

$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$

where $\text{rank}_q$ is the rank of the first relevant document for query $q$ .

MRR focuses on where the first relevant document appears. If the first relevant document is at rank 1, MRR contribution is 1.0. If it's at rank 5, contribution is 0.2. MRR is most appropriate when there's only one relevant document per query (or you only care about the most relevant document).

MAP (Mean Average Precision)

MAP is used when each query has multiple relevant documents, and you want to reward systems that retrieve them all, not just the first one.

$\text{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{K} P@k \cdot \text{rel}(k)$

where $R_q$ is the set of relevant documents, $P@k$ is precision at rank $k$ , and $\text{rel}(k)$ is 1 if the document at rank $k$ is relevant.

MAP = mean of AP across all queries.

Which metric to use?

Use Case	Recommended Metric
Single relevant doc per query (Q&A, RAG)	MRR and Recall@K
Multiple relevant docs (general retrieval)	nDCG@10 and MAP
MTEB-standard comparison	nDCG@10
User-focused (top result quality)	MRR or Recall@1
Dataset with relevance grades (0, 1, 2)	nDCG@K

Running MTEB Locally

You can evaluate any embedding model on MTEB using the mteb Python package:

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Basic evaluation on a subset of tasks
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Retrieval tasks most relevant to RAG applications
retrieval_tasks = [
    "NFCorpus",      # Medical retrieval
    "MSMARCO",       # Web search
    "HotpotQA",      # Multi-hop reasoning
    "FiQA2018",      # Financial Q&A
    "ArguAna",       # Argument retrieval
]

evaluation = MTEB(tasks=retrieval_tasks)
results = evaluation.run(
    model,
    output_folder="./mteb_results/bge-large",
    eval_splits=["test"],
    overwrite_results=False,
)

# Results are saved as JSON files
# Load and display
import json
from pathlib import Path

results_dir = Path("./mteb_results/bge-large")
for result_file in results_dir.glob("*.json"):
    with open(result_file) as f:
        result = json.load(f)
    task_name = result_file.stem
    ndcg_10 = result.get("test", {}).get("ndcg_at_10", 0)
    print(f"{task_name}: nDCG@10 = {ndcg_10:.4f}")


# Custom model wrapper if your model isn't a SentenceTransformer
class CustomModelWrapper:
    """Wrap any model to work with MTEB."""

    def __init__(self, embed_func):
        self.embed_func = embed_func

    def encode(
        self,
        sentences: list[str],
        batch_size: int = 32,
        show_progress_bar: bool = False,
        **kwargs
    ) -> np.ndarray:
        all_embeddings = []
        for i in range(0, len(sentences), batch_size):
            batch = sentences[i:i + batch_size]
            embeddings = self.embed_func(batch)
            all_embeddings.append(np.array(embeddings))
        return np.vstack(all_embeddings)


# Use with OpenAI embeddings
def openai_embed(texts: list[str]) -> list[list[float]]:
    from openai import OpenAI
    client = OpenAI()
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

custom_wrapper = CustomModelWrapper(openai_embed)
evaluation = MTEB(tasks=["NFCorpus"])
results = evaluation.run(custom_wrapper, output_folder="./mteb_results/openai-3-small")

Building Your Own Domain Evaluation Set

MTEB benchmarks are excellent for model comparison but may not reflect your domain performance. Build a domain-specific evaluation set:

Step 1: Collect query-document pairs

Option A: From user logs If you have existing search logs, use query-click data as weak relevance signal. Clicked documents are likely relevant; unclicked documents in the same session might be negative.

Option B: Human annotation Have domain experts annotate (query, document) pairs with relevance scores (0=irrelevant, 1=partial, 2=relevant, 3=highly relevant). Even 200 annotated pairs significantly improves evaluation quality.

Option C: Synthetic with LLM + human validation Generate query-document pairs synthetically (as in Lesson 04) and have a human expert validate 20-30% of them. Use human agreement rate to calibrate confidence in synthetic pairs.

Step 2: Create a corpus

Your evaluation corpus should include:

All relevant documents for your queries
Many irrelevant documents (the "distractors")
Corpus size should be at least 10× the number of relevant documents

Small corpora are easy (any model retrieves the relevant document from 10 options). Large corpora are hard (finding the relevant document among 10,000 requires a good model). Aim for at least 1,000-10,000 documents in your evaluation corpus.

Step 3: Implement your evaluation

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from dataclasses import dataclass
from typing import Optional

@dataclass
class DomainEvalDataset:
    queries: dict[str, str]           # {qid: query_text}
    corpus: dict[str, str]            # {docid: doc_text}
    relevant_docs: dict[str, set[str]]  # {qid: set_of_relevant_docids}
    name: str = "domain-eval"


def run_domain_evaluation(
    model: SentenceTransformer,
    dataset: DomainEvalDataset,
    batch_size: int = 256,
    k: int = 10,
    prefix_query: Optional[str] = None,   # E.g., "query: " for E5
    prefix_doc: Optional[str] = None,     # E.g., "passage: " for E5
) -> dict:
    """
    Evaluate an embedding model on a domain-specific evaluation set.
    """
    queries = list(dataset.queries.items())  # [(qid, text)]
    corpus = list(dataset.corpus.items())     # [(docid, text)]

    # Apply prefixes if required by the model
    query_texts = [
        f"{prefix_query}{text}" if prefix_query else text
        for _, text in queries
    ]
    doc_texts = [
        f"{prefix_doc}{text}" if prefix_doc else text
        for _, text in corpus
    ]

    print(f"Encoding {len(queries)} queries...")
    query_embs = model.encode(
        query_texts, normalize_embeddings=True,
        batch_size=batch_size, show_progress_bar=True
    )

    print(f"Encoding {len(corpus)} corpus documents...")
    doc_embs = model.encode(
        doc_texts, normalize_embeddings=True,
        batch_size=batch_size, show_progress_bar=True
    )

    # Build FAISS index
    dim = doc_embs.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(doc_embs.astype(np.float32))

    # Search
    sims, indices = index.search(query_embs.astype(np.float32), k)

    # Compute metrics
    docid_list = [docid for docid, _ in corpus]
    ndcg_scores, recall_scores, mrr_scores = [], [], []

    for i, (qid, _) in enumerate(queries):
        relevant = dataset.relevant_docs.get(qid, set())
        retrieved = [docid_list[idx] for idx in indices[i]]

        # Relevance binary: 1 if retrieved doc is relevant
        relevances = [1 if doc_id in relevant else 0 for doc_id in retrieved]

        ndcg_scores.append(ndcg_at_k(relevances, k))
        mrr = next(
            (1.0 / (rank + 1) for rank, doc_id in enumerate(retrieved)
             if doc_id in relevant),
            0.0
        )
        mrr_scores.append(mrr)

        n_retrieved_relevant = sum(1 for doc_id in retrieved if doc_id in relevant)
        recall_scores.append(n_retrieved_relevant / max(1, len(relevant)))

    metrics = {
        f"nDCG@{k}": float(np.mean(ndcg_scores)),
        f"Recall@{k}": float(np.mean(recall_scores)),
        "MRR": float(np.mean(mrr_scores)),
        "n_queries": len(queries),
        "corpus_size": len(corpus),
        "dataset": dataset.name,
    }

    return metrics


def compare_models(
    model_names: list[str],
    dataset: DomainEvalDataset,
    model_configs: dict = None,  # {model_name: {"prefix_query": ..., "prefix_doc": ...}}
) -> dict:
    """Compare multiple models on the same domain evaluation set."""
    results = {}
    for model_name in model_names:
        print(f"\nEvaluating {model_name}...")
        model = SentenceTransformer(model_name)
        config = (model_configs or {}).get(model_name, {})
        metrics = run_domain_evaluation(model, dataset, **config)
        results[model_name] = metrics
        print(f"  nDCG@10: {metrics['nDCG@10']:.4f}, "
              f"MRR: {metrics['MRR']:.4f}, "
              f"Recall@10: {metrics['Recall@10']:.4f}")

    return results

Step 4: Interpret results

Key patterns to look for:

High nDCG@10, low MRR: The model retrieves relevant documents in the top-10 but not necessarily at rank 1. For user-facing search, this is often acceptable. For single-step RAG, improve MRR.

High Recall@10, low nDCG@10: The model finds relevant documents but ranks them inconsistently. Consider adding a reranker (cross-encoder) to improve ranking precision.

Low Recall@10: The model simply doesn't find relevant documents. Options: better embedding model, more aggressive domain fine-tuning, hybrid retrieval (BM25 + embeddings).

The Contamination Problem

MTEB benchmarks are public. Training data from MTEB's test sets may have leaked into training data for recent models - particularly large closed-source models trained on web-scraped data.

This matters because:

Models that have "seen" MTEB test examples perform better on MTEB not because they're better embedders, but because they memorized answers
MTEB rankings for very large models (especially via API) may be inflated by contamination
Open-source models with documented training data are more reliable for benchmark comparison

Detecting contamination

Checking for contamination in your model isn't always possible (especially for API models). Signs of potential contamination:

Model performs significantly better on MTEB than on your domain evaluation set
Model performance on newer MTEB datasets (added after the model's training cutoff) is much lower than on older datasets
The model is trained on web-scraped data without explicit deduplication against MTEB test sets

For reliable comparison, use:

Holdout datasets that are not part of public benchmarks
Your own domain evaluation sets (created after the model's training cutoff)
Multiple evaluation methodologies (MTEB + domain + human evaluation)

Online vs Offline Evaluation

Offline evaluation (what we've discussed)

Evaluate on a static held-out dataset with labeled relevance. Fast, reproducible, cheap. Limitations: static dataset may not reflect production distribution, and labeled relevance may not match actual user utility.

Online evaluation (A/B testing)

Deploy two versions of your embedding system and measure real user behavior:

Metrics for online evaluation of retrieval:

Click-through rate (CTR): fraction of queries where user clicks a result
Click position distribution: ideally, users click results at rank 1-2 most often
Dwell time: how long users spend with retrieved content (proxy for quality)
No-click rate: fraction of queries with no clicks (user didn't find what they wanted)
Conversion rate: fraction of searches leading to downstream actions (purchases, sign-ups)

# Tracking retrieval quality metrics in production

from dataclasses import dataclass
from datetime import datetime

@dataclass
class SearchEvent:
    timestamp: datetime
    query: str
    session_id: str
    model_version: str  # Which embedding model was used
    retrieved_doc_ids: list[str]
    clicked_doc_id: str | None  # Which result the user clicked, if any
    click_rank: int | None  # Rank of the clicked result

def compute_online_metrics(events: list[SearchEvent]) -> dict:
    """Compute online retrieval quality metrics from production events."""
    if not events:
        return {}

    ctr = sum(1 for e in events if e.clicked_doc_id is not None) / len(events)
    mrr = 0.0
    for event in events:
        if event.click_rank is not None:
            mrr += 1.0 / event.click_rank
    mrr /= len(events)

    # Per-model breakdown (for A/B test comparison)
    by_model = {}
    for event in events:
        model = event.model_version
        if model not in by_model:
            by_model[model] = []
        by_model[model].append(event)

    model_metrics = {
        model: {
            "ctr": sum(1 for e in m_events if e.clicked_doc_id) / len(m_events),
            "mrr": sum(1/e.click_rank for e in m_events if e.click_rank) / len(m_events),
            "n_queries": len(m_events),
        }
        for model, m_events in by_model.items()
    }

    return {"overall": {"ctr": ctr, "mrr": mrr}, "by_model": model_metrics}

Common Mistakes

:::danger Reporting only the MTEB average The MTEB average includes clustering, classification, STS, and other tasks that may be irrelevant to your use case. For RAG and search applications, report Retrieval Average specifically. A model with high MTEB average driven by strong clustering performance may have mediocre retrieval - which is what actually matters for your system. :::

:::danger Not holding out evaluation data from fine-tuning If you create an evaluation set and then use it (or similar data) for fine-tuning, your evaluation is contaminated. Keep a strictly held-out test set that is never used for training, validation, or hyperparameter selection. Only look at the test set once you're done with model development. :::

:::warning Using a corpus that's too small for evaluation A corpus of 100 documents makes retrieval trivially easy - any model achieves high Recall@10. A corpus of 10,000 documents is more realistic. Use at least 1,000 documents in your evaluation corpus, and ensure the ratio of relevant to irrelevant documents matches your production setting. :::

:::warning Conflating semantic similarity and retrieval STS metrics (Spearman correlation on similarity scores) measure how well the model ranks pairs by similarity. Retrieval metrics (nDCG@10, Recall@K) measure how well the model finds relevant documents from a corpus. These are related but different tasks. A model with high STS scores might have mediocre retrieval performance if it doesn't generalize from pair similarity to corpus-scale retrieval. :::

:::tip Evaluate on multiple query types separately Different query types (keyword queries, natural language questions, technical queries) often have different performance profiles. Evaluate your model separately on each type. This reveals model weaknesses that an average metric hides, and lets you target fine-tuning or retrieval improvements where they matter most. :::

Interview Q&A

Q1: What is nDCG@10 and why is it the standard retrieval metric?

nDCG@10 (Normalized Discounted Cumulative Gain at 10) measures retrieval quality while accounting for the rank position of relevant documents. DCG computes cumulative relevance with a logarithmic discount by rank position - rank 1 has no discount, rank 7 is discounted by $\log_2(8)=3$ . IDCG is the DCG of the ideal ranking (all relevant documents at the top). nDCG = DCG/IDCG, normalized to $[0,1]$ .

It's the standard because it rewards systems that put relevant documents at higher ranks (not just anywhere in the top 10), handles multiple relevant documents naturally, and is graded by rank (higher rank = higher value). Simple metrics like Recall@K treat rank 1 and rank 10 equivalently, which doesn't match user behavior.

Q2: What's the difference between Recall@K and MRR?

Recall@K asks: "What fraction of all relevant documents did we retrieve in the top K?" It measures coverage - how many relevant documents we found. Good when there are multiple relevant documents per query.

MRR asks: "What is the average reciprocal rank of the first relevant document?" It measures how quickly we find the first relevant result. A result at rank 1 contributes 1.0 to MRR; rank 5 contributes 0.2. Good for applications where finding the best single answer is the goal (single-document Q&A, FAQ retrieval).

For RAG: both matter. Recall@K ensures the relevant context is retrieved. MRR (or Recall@1) ensures the relevant context is at a high rank where the LLM will actually use it.

Q3: How do you build a domain-specific evaluation set from scratch?

Three steps. First, collect query-document pairs from your domain: use search logs (query-click pairs as weak relevance signal), human annotation (most reliable), or synthetic generation (LLM-generated queries per document, filtered for quality). Aim for 200-500 query-document pairs minimum.

Second, build a realistic corpus: include all relevant documents plus at least 10× as many distractors from your domain (not from general web text). Too-small corpora make retrieval trivially easy.

Third, implement evaluation: use the standard metrics (nDCG@10, Recall@K, MRR) and compare your baseline model to alternatives. Keep a strict test set that's never used for fine-tuning or hyperparameter selection.

Q4: What is the MTEB contamination problem?

MTEB test sets are public data. Large models trained on web-scraped data may have seen MTEB test examples in their training data, inflating their benchmark scores - they "memorize" answers rather than generalizing retrieval quality. This is particularly concerning for large closed-source API models where training data is not disclosed.

Mitigations: use holdout datasets not in public benchmarks, prefer models with documented training data that explicitly excludes MTEB test sets, use your own domain evaluation set (created after the model's training cutoff), and compare models on multiple evaluation methodologies rather than relying solely on MTEB.

Q5: When should you run offline evaluation vs online A/B testing for embedding models?

Offline evaluation is fast, cheap, and reproducible - use it during development to compare model candidates and tune hyperparameters. It catches systematic quality differences but may not reflect user behavior perfectly (labeled relevance ≠ user utility).

Online A/B testing measures real user behavior - click-through rate, dwell time, conversion rate - which is the ultimate signal for user-facing applications. Use it after you've selected a candidate via offline evaluation to confirm real-world improvement before full deployment. Never launch a new embedding model based only on offline metrics; the deployment infrastructure and real query distribution can change results. A/B testing typically requires 1,000-10,000 queries per condition to detect meaningful differences.

Summary

Embedding evaluation requires multiple complementary approaches:

Standard metrics:

nDCG@K: Primary retrieval metric - rewards ranking relevant documents at higher positions
Recall@K: Did we retrieve relevant documents at all?
MRR: How quickly do we find the first relevant result?
MAP: Comprehensive for multiple relevant documents per query

MTEB: The standard benchmark covering 56 datasets and 8 task types. Use Retrieval Average for RAG applications. Be aware of contamination for large models.

Domain evaluation: Build your own domain-specific evaluation set for any specialized application. This is the most reliable predictor of production performance.

Online evaluation: A/B test in production to confirm improvements with real user behavior. Offline metrics are necessary but not sufficient for deployment decisions.

The key principle: always evaluate on data that matches your actual deployment distribution, not just on public benchmarks.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Model Evaluation (MTEB) demo on the EngineersOfAI Playground - no code required.

:::

The Model That Scored Well on Everything Except What Mattered​

Why Evaluation Is Hard for Embeddings​

MTEB: The Standard Evaluation Framework​

The 8 task types​

nDCG@K: The Key Retrieval Metric​

Motivation​

DCG (Discounted Cumulative Gain)​

IDCG and nDCG​

Recall@K, MRR, and MAP​

Recall@K​

MRR (Mean Reciprocal Rank)​

MAP (Mean Average Precision)​

Which metric to use?​

Running MTEB Locally​

Building Your Own Domain Evaluation Set​

Step 1: Collect query-document pairs​

Step 2: Create a corpus​

Step 3: Implement your evaluation​

Step 4: Interpret results​

The Contamination Problem​

Detecting contamination​

Online vs Offline Evaluation​

Offline evaluation (what we've discussed)​

Online evaluation (A/B testing)​

Common Mistakes​

Interview Q&A​

Summary​