What is fine-tuning embeddings?

Contrastive fine-tuning with triplet loss, hard negative mining, in-batch negatives, synthetic data generation, TSDAE, GPL, and a full worked example on domain adaptation.

How does contrastive learning work in practice?

Fine-Tuning Embedding Models for Your Domain covers fine-tuning embeddings, contrastive learning, hard negative mining from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/embeddings-engineering/fine-tuning-embedding-models

What is the difference between fine-tuning embeddings and hard negative mining?

See the full breakdown at https://engineersofai.com/docs/llms/embeddings-engineering/fine-tuning-embedding-models

Fine-Tuning Embedding Models for Your Domain

Reading time: 28 min | Relevance: AI Engineer, ML Engineer, Research Engineer

When the General Model Fails

You're building a RAG system for a biomedical company. You index 50,000 clinical trial documents. You use BGE-large - top of the MTEB leaderboard, state-of-the-art. You query: "adverse events in phase 3 oncology studies." You retrieve... studies about event planning software. The query "adverse events" retrieves documents about corporate events because "adverse" is rare in general text and the model associates "events" with scheduled gatherings.

You switch to voyage-3-finance. Better, but still wrong - clinical trial language has its own vocabulary that neither financial nor general models understand. "SAE" means serious adverse event, not a financial instrument. "Endpoint" is a clinical outcome, not a network endpoint. "Arms" are trial groups, not body parts.

This is the domain gap problem. General embedding models are trained on general text. When your domain uses specialized terminology, abbreviations, and conceptual frameworks that rarely appear in general pre-training data, general models fail. The solution is fine-tuning - training the embedding model on examples from your domain.

Fine-tuned domain-specific embedding models consistently outperform general models on domain-specific tasks by 10-30 percentage points in retrieval quality. The barrier is data: you need (query, positive document, negative documents) triplets to train with contrastive learning. This lesson covers how to get that data - including when you don't have much labeled data - and how to run the fine-tuning.

Why General Embeddings Underperform on Specialized Text

Vocabulary mismatch

General embedding models learn representations based on word co-occurrence in general text. Specialized vocabulary - clinical terms, legal jargon, programming idioms, financial terminology - appears rarely in general pre-training data. The model has weak or no representations for these terms.

Conceptual framework mismatch

In general text, "model" most commonly refers to a physical or fashion model. In ML discourse, it refers to a trained neural network. The embedding model may not have learned to associate ML-context "model" with training, inference, and evaluation.

Retrieval asymmetry mismatch

In domain Q&A, questions are often short ("What is the standard of care for sepsis?") while answers are long paragraphs from clinical guidelines. General models may not have been trained with this specific asymmetry in mind for your domain.

Quantifying the gap

Benchmark your target domain with a small sample of 100-200 query-document pairs with relevance labels:

from sentence_transformers import SentenceTransformer
import numpy as np

def evaluate_model_on_domain(
    model_name: str,
    queries: list[str],
    relevant_docs: list[str],
    corpus: list[str],  # All documents including relevant and irrelevant
    relevant_indices: list[int],  # Index in corpus for each query's relevant doc
) -> dict:
    """
    Evaluate an embedding model on domain-specific retrieval.
    Computes Recall@1, Recall@5, MRR.
    """
    model = SentenceTransformer(model_name)

    query_embeddings = model.encode(queries, normalize_embeddings=True, show_progress_bar=True)
    corpus_embeddings = model.encode(corpus, normalize_embeddings=True, show_progress_bar=True)

    # Compute similarities
    similarities = query_embeddings @ corpus_embeddings.T  # (n_queries, n_corpus)

    recall_at_1 = 0
    recall_at_5 = 0
    mrr = 0

    for i, relevant_idx in enumerate(relevant_indices):
        sims = similarities[i]
        ranked = np.argsort(-sims)

        rank = np.where(ranked == relevant_idx)[0][0] + 1  # 1-indexed rank

        if rank == 1:
            recall_at_1 += 1
        if rank <= 5:
            recall_at_5 += 1
        mrr += 1.0 / rank

    n = len(queries)
    return {
        "model": model_name,
        "recall@1": recall_at_1 / n,
        "recall@5": recall_at_5 / n,
        "mrr": mrr / n,
    }

Contrastive Fine-Tuning: The Core Approach

The standard fine-tuning approach uses labeled (query, positive, negative) triplets with a contrastive loss.

Data format

# Training example format for contrastive fine-tuning
training_example = {
    "query": "What are the inclusion criteria for phase 3 oncology trials?",
    "positive": "Phase 3 oncology trials typically require ECOG performance status 0-2, "
                "adequate organ function, and no prior systemic treatment...",
    "negatives": [
        "Phase 1 trials focus on dose escalation and safety profiles...",
        "Oncology conferences provide networking opportunities for...",
    ]
}

The MultipleNegativesRankingLoss

The most commonly used loss for embedding fine-tuning is MultipleNegativesRankingLoss from the Sentence Transformers library. It treats all non-matching examples in the batch as negatives - essentially InfoNCE loss:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator

def fine_tune_embedding_model(
    base_model: str,
    train_examples: list[dict],  # [{"query": str, "positive": str, "negative": str}]
    val_queries: dict[str, str],     # {qid: query_text}
    val_corpus: dict[str, str],      # {docid: doc_text}
    val_relevant: dict[str, set[str]],  # {qid: set of relevant docids}
    output_dir: str = "./fine-tuned-embedding",
    epochs: int = 1,
    batch_size: int = 64,
    warmup_steps: int = 100,
    learning_rate: float = 2e-5,
):
    """
    Fine-tune an embedding model using contrastive learning.
    """
    model = SentenceTransformer(base_model)

    # Convert to Sentence Transformers InputExample format
    # For MultipleNegativesRankingLoss: (anchor, positive) pairs
    # The loss uses all other positives in the batch as negatives
    input_examples = []
    for ex in train_examples:
        # Basic: just anchor-positive pairs (in-batch negatives are free)
        input_examples.append(
            InputExample(texts=[ex["query"], ex["positive"]])
        )

    train_dataloader = DataLoader(
        input_examples, shuffle=True, batch_size=batch_size
    )

    # MultipleNegativesRankingLoss = InfoNCE with in-batch negatives
    train_loss = losses.MultipleNegativesRankingLoss(model=model)

    # Evaluation on domain-specific retrieval benchmark
    evaluator = InformationRetrievalEvaluator(
        queries=val_queries,
        corpus=val_corpus,
        relevant_docs=val_relevant,
        name="domain-retrieval",
        score_functions={"cos_sim": lambda x, y: (x @ y.T)},
    )

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        evaluator=evaluator,
        epochs=epochs,
        warmup_steps=warmup_steps,
        optimizer_params={"lr": learning_rate},
        output_path=output_dir,
        save_best_model=True,
        show_progress_bar=True,
    )

    return model

Hard Negative Mining: Why It Matters

The quality of negatives is as important as the quality of positives. Random negatives are too easy - the model quickly learns to separate "medical query" from "sports article." Hard negatives - documents that are semantically similar to the query but not the correct answer - force the model to learn finer distinctions.

Example of easy vs hard negatives

Query: "What is the mechanism of action of metformin in Type 2 diabetes?"

Easy negative (random from corpus): "The 2023 UEFA Champions League final was held in Istanbul..." → The model easily learns to separate this from the query.

Hard negative: "Metformin is a biguanide drug used as first-line treatment for Type 2 diabetes, working primarily by activating AMPK pathways to reduce hepatic glucose production." → This is about metformin and diabetes! The model must understand subtle differences to rank the positive higher.

Without hard negatives, the model plateaus quickly. With hard negatives, the model is forced to learn the specific distinctions your task requires.

Mining hard negatives

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

def mine_hard_negatives(
    queries: list[str],
    positives: list[str],  # positives[i] is the positive for queries[i]
    corpus: list[str],
    model: SentenceTransformer,
    n_hard_negatives: int = 5,
    positive_similarity_threshold: float = 0.75,
) -> list[dict]:
    """
    Mine hard negatives: documents close to the query but not the positive.

    Strategy:
    1. Embed all queries and corpus documents
    2. For each query, find top-50 nearest documents (by cosine similarity)
    3. Exclude the positive and documents with similarity > threshold
    4. Use the remaining top-n as hard negatives
    """
    print("Embedding queries...")
    query_embeddings = model.encode(queries, normalize_embeddings=True, batch_size=256)

    print("Embedding corpus...")
    corpus_embeddings = model.encode(corpus, normalize_embeddings=True, batch_size=256)

    # Build FAISS index for fast nearest neighbor search
    dim = corpus_embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)  # Inner product (= cosine for normalized vectors)
    index.add(corpus_embeddings.astype(np.float32))

    # For each query, find nearest documents
    k = 100  # Retrieve more than we need, then filter
    similarities, doc_indices = index.search(query_embeddings.astype(np.float32), k)

    # Build positive lookup
    positive_embeddings = model.encode(positives, normalize_embeddings=True, batch_size=256)
    positive_sims = (query_embeddings * positive_embeddings).sum(axis=1)  # Cosine sim to positive

    training_examples = []
    for i, (query, positive, pos_sim) in enumerate(zip(queries, positives, positive_sims)):
        hard_negs = []
        for j, (sim, doc_idx) in enumerate(zip(similarities[i], doc_indices[i])):
            doc = corpus[doc_idx]
            if doc == positive:
                continue  # Skip the positive
            if sim > positive_similarity_threshold:
                # Too similar to the positive - might be a false negative
                continue
            hard_negs.append((sim, doc))
            if len(hard_negs) >= n_hard_negatives:
                break

        if hard_negs:
            training_examples.append({
                "query": query,
                "positive": positive,
                "negatives": [doc for _, doc in hard_negs],
            })

    print(f"Mined hard negatives for {len(training_examples)}/{len(queries)} queries")
    return training_examples


# Using hard negatives with TripletLoss
from sentence_transformers import losses, InputExample

def create_triplet_dataset(training_examples_with_negatives: list[dict]) -> list[InputExample]:
    """Convert hard-negative mined examples to triplet format."""
    triplets = []
    for ex in training_examples_with_negatives:
        for neg in ex["negatives"]:
            triplets.append(InputExample(
                texts=[ex["query"], ex["positive"], neg]
            ))
    return triplets

Synthetic Data Generation with LLMs

If you don't have labeled query-document pairs, LLMs can generate synthetic queries for your documents - a powerful approach when you have documents but no queries.

Query generation

import anthropic

client = anthropic.Anthropic()

def generate_queries_for_passage(
    passage: str,
    n_queries: int = 5,
    query_type: str = "information-seeking question"
) -> list[str]:
    """
    Generate synthetic queries for a document passage using Claude.
    These (passage, query) pairs form the basis of training data.
    """
    prompt = f"""Below is a passage from a document. Generate {n_queries} diverse
{query_type}s that this passage would be a good answer to.

Requirements:
- Each question should be something a real user would ask
- Questions should be answerable (at least partially) by the passage
- Questions should vary in phrasing and specificity
- Do not use the exact wording from the passage

Passage:
{passage}

Generate exactly {n_queries} questions, one per line, numbered 1-{n_queries}:"""

    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Fast, cheap for bulk generation
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )

    # Parse numbered list
    lines = response.content[0].text.strip().split("\n")
    queries = []
    for line in lines:
        # Remove numbering (1., 2., etc.)
        query = line.strip()
        for prefix in ["1. ", "2. ", "3. ", "4. ", "5. ", "6. ", "7. ", "8. ", "9. ", "10. "]:
            if query.startswith(prefix):
                query = query[len(prefix):]
        if query and len(query) > 10:  # Filter very short queries
            queries.append(query)

    return queries[:n_queries]


def generate_synthetic_training_data(
    passages: list[str],
    n_queries_per_passage: int = 3,
    batch_size: int = 10,
) -> list[dict]:
    """
    Generate synthetic training data for all passages.
    Each passage gets n_queries_per_passage synthetic queries.
    """
    training_data = []

    for i in range(0, len(passages), batch_size):
        batch = passages[i:i + batch_size]
        for passage in batch:
            queries = generate_queries_for_passage(passage, n_queries_per_passage)
            for query in queries:
                training_data.append({
                    "query": query,
                    "positive": passage,
                })

        if i % 100 == 0:
            print(f"Generated queries for {i}/{len(passages)} passages")

    return training_data

Quality filtering synthetic data

Not all LLM-generated queries are good training examples. Filter for quality:

def filter_synthetic_data(
    training_data: list[dict],
    filter_model: SentenceTransformer,
    min_query_passage_similarity: float = 0.3,
    max_query_length: int = 200,
    min_query_length: int = 10,
) -> list[dict]:
    """
    Filter synthetic training data for quality.
    Remove queries that are too similar to the passage (memorization)
    or too dissimilar (off-topic generation).
    """
    queries = [ex["query"] for ex in training_data]
    passages = [ex["positive"] for ex in training_data]

    q_embs = filter_model.encode(queries, normalize_embeddings=True, batch_size=512)
    p_embs = filter_model.encode(passages, normalize_embeddings=True, batch_size=512)

    similarities = (q_embs * p_embs).sum(axis=1)

    filtered = []
    for ex, sim in zip(training_data, similarities):
        query = ex["query"]
        if sim < min_query_passage_similarity:
            continue  # Query is off-topic
        if len(query) < min_query_length or len(query) > max_query_length:
            continue  # Query is too short or too long
        if sim > 0.9:
            continue  # Query is basically copying the passage
        filtered.append(ex)

    print(f"Filtered: {len(filtered)}/{len(training_data)} examples kept")
    return filtered

GPL: Generative Pseudo Labeling

GPL (Generative Pseudo Labeling, Wang et al. 2021) is a technique for domain adaptation when you have no labeled data at all. It uses an LLM to generate pseudo-relevance labels:

Generate queries from your domain documents (as above)
Retrieve candidate documents for each generated query using a general embedding model
Score candidate documents using a cross-encoder that estimates relevance
Use cross-encoder scores as labels to train the embedding model with margin ranking loss

from sentence_transformers import CrossEncoder

def gpl_training_data_generation(
    corpus: list[str],
    bi_encoder_model: str = "BAAI/bge-large-en-v1.5",
    cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    n_queries_per_doc: int = 3,
    n_negatives: int = 5,
) -> list[dict]:
    """
    Generate GPL training data:
    1. Generate queries from documents
    2. Retrieve candidates using bi-encoder
    3. Score with cross-encoder
    4. Create margin-based training triplets
    """
    bi_encoder = SentenceTransformer(bi_encoder_model)
    cross_encoder = CrossEncoder(cross_encoder_model)

    # Step 1: Generate synthetic queries
    print("Generating synthetic queries...")
    synthetic_data = []
    for doc in corpus[:100]:  # Limit for demonstration
        queries = generate_queries_for_passage(doc, n_queries=n_queries_per_doc)
        for q in queries:
            synthetic_data.append({"query": q, "positive": doc})

    # Step 2: Encode corpus with bi-encoder
    print("Encoding corpus...")
    corpus_embeddings = bi_encoder.encode(
        corpus, normalize_embeddings=True, batch_size=256, show_progress_bar=True
    )

    dim = corpus_embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(corpus_embeddings.astype(np.float32))

    # Step 3 & 4: For each query, find candidates and score with cross-encoder
    gpl_examples = []
    for example in synthetic_data:
        query = example["query"]
        query_emb = bi_encoder.encode([query], normalize_embeddings=True)

        # Retrieve top candidates (not including the positive)
        _, candidate_indices = index.search(query_emb.astype(np.float32), n_negatives + 5)
        candidate_docs = [corpus[idx] for idx in candidate_indices[0]
                         if corpus[idx] != example["positive"]][:n_negatives]

        if not candidate_docs:
            continue

        # Score positive + candidates with cross-encoder
        positive_score = cross_encoder.predict([[query, example["positive"]]])[0]
        negative_scores = cross_encoder.predict([[query, doc] for doc in candidate_docs])

        # Use the candidate with highest cross-encoder score as "hard negative"
        best_negative_idx = np.argmax(negative_scores)
        best_negative = candidate_docs[best_negative_idx]
        best_negative_score = negative_scores[best_negative_idx]

        # Only create a training example if positive score > negative score
        if positive_score > best_negative_score:
            gpl_examples.append({
                "query": query,
                "positive": example["positive"],
                "negative": best_negative,
                "margin": positive_score - best_negative_score,
            })

    return gpl_examples

TSDAE: Unsupervised Fine-Tuning

TSDAE (Transformation-Based Denoising Auto-Encoder for Sentence Embeddings, Wang et al. 2021) is an unsupervised fine-tuning method that doesn't require any labeled pairs. It works by:

Corrupting input sentences by deleting 60% of tokens randomly
Training the model to reconstruct the original sentence from the corrupted embedding
The bottleneck embedding must encode enough information to enable reconstruction → learns good sentence representations

TSDAE works with domain text alone - no labels needed. It's particularly useful as a first step before supervised fine-tuning.

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.datasets import DenoisingAutoEncoderDataset
from torch.utils.data import DataLoader

def tsdae_unsupervised_training(
    base_model: str,
    domain_sentences: list[str],
    output_path: str = "./tsdae-model",
    epochs: int = 1,
    batch_size: int = 8,  # Small batch - TSDAE is memory intensive
    deletion_ratio: float = 0.6,
):
    """
    Unsupervised domain adaptation using TSDAE.
    Only requires domain text - no labels needed.
    """
    model = SentenceTransformer(base_model)

    # TSDAE corrupts sentences by random token deletion
    train_dataset = DenoisingAutoEncoderDataset(
        domain_sentences,
        noise_fn=lambda t: DenoisingAutoEncoderDataset.delete(t, deletion_ratio)
    )

    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # TSDAE loss: encoder-decoder with denoising objective
    train_loss = losses.DenoisingAutoEncoderLoss(
        model,
        decoder_name_or_path=base_model,
        tie_encoder_decoder=True,
    )

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        weight_decay=0,
        scheduler="constantlr",
        optimizer_params={"lr": 3e-5},
        show_progress_bar=True,
        output_path=output_path,
    )

    return model

Full Worked Example: Biomedical Domain Fine-Tuning

Here's a complete pipeline for fine-tuning BGE on medical Q&A:

from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
from datasets import load_dataset
import numpy as np

def finetune_bge_medical(
    output_dir: str = "./bge-medical",
    epochs: int = 2,
):
    """
    Fine-tune BGE-large on medical Q&A for better clinical retrieval.
    Uses MedQA and PubMedQA as training data sources.
    """
    # Step 1: Load base model
    base_model = "BAAI/bge-large-en-v1.5"
    model = SentenceTransformer(base_model)

    # Step 2: Load medical training data
    # Using PubMedQA (publicly available medical Q&A dataset)
    # In practice, you'd use your own domain data here
    dataset = load_dataset("qiaojin/PubMedQA", "pqa_labeled", split="train")

    # Convert to (query, positive, negative) format
    # PubMedQA provides questions, contexts, and answer labels
    train_examples = []
    for item in dataset:
        question = item["question"]
        contexts = item["context"]["contexts"]
        labels = item["context"]["labels"]

        # Contexts labeled "yes" are positives, others are negatives
        positives = [c for c, l in zip(contexts, labels) if l == "yes"]
        negatives = [c for c, l in zip(contexts, labels) if l == "no"]

        if positives and negatives:
            train_examples.append(InputExample(
                texts=[question, positives[0], negatives[0]]
            ))

    print(f"Training examples: {len(train_examples)}")

    # Step 3: Mine hard negatives using the base model
    # (Simplified here - in practice, use the mine_hard_negatives function above)

    # Step 4: Set up training with BGE's required instruction prefix
    # BGE models use "Represent this sentence: " prefix for encoding
    # Note: for training, we don't add the prefix - the training handles this

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

    # Use TripletLoss since we have explicit positives and negatives
    train_loss = losses.TripletLoss(
        model=model,
        triplet_margin=0.5,  # Minimum margin between positive and negative similarity
    )

    # Step 5: Evaluate with medical retrieval benchmark
    # Create a simple evaluation with held-out examples
    eval_size = min(200, len(train_examples) // 5)
    eval_examples = train_examples[-eval_size:]
    train_examples = train_examples[:-eval_size]

    val_queries = {str(i): ex.texts[0] for i, ex in enumerate(eval_examples)}
    val_corpus = {str(i): ex.texts[1] for i, ex in enumerate(eval_examples)}
    val_relevant = {str(i): {str(i)} for i in range(len(eval_examples))}

    evaluator = InformationRetrievalEvaluator(
        queries=val_queries,
        corpus=val_corpus,
        relevant_docs=val_relevant,
        name="medical-retrieval",
    )

    # Step 6: Run training
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        evaluator=evaluator,
        epochs=epochs,
        warmup_steps=len(train_dataloader) // 10,
        output_path=output_dir,
        save_best_model=True,
        show_progress_bar=True,
        evaluation_steps=500,
    )

    # Step 7: Compare base vs fine-tuned model on medical queries
    test_queries = [
        "What are the symptoms of acute myocardial infarction?",
        "How does RAAS inhibition reduce blood pressure?",
        "What is the mechanism of action of statins?",
    ]

    print("\nBase model vs fine-tuned model comparison:")
    base = SentenceTransformer(base_model)
    finetuned = SentenceTransformer(output_dir)

    for query in test_queries:
        base_emb = base.encode(query)
        ft_emb = finetuned.encode(query)
        print(f"\nQuery: {query}")
        print(f"Base embedding norm: {np.linalg.norm(base_emb):.3f}")
        print(f"Fine-tuned embedding norm: {np.linalg.norm(ft_emb):.3f}")
        # In practice: compare retrieval quality on a held-out test set

    return finetuned

Production Engineering Notes

When to fine-tune vs use a general model

Scenario	Recommendation
General web text, English	BGE-large or E5-large, no fine-tuning
Specialized domain, have 10k+ labeled pairs	Fine-tune on labeled data
Specialized domain, have text but no labels	TSDAE pre-training + GPL
Specialized domain, limited labeled data (< 1k)	GPL with synthetic queries
Multilingual domain	BGE-M3, consider fine-tuning

Continuous fine-tuning

Domain language evolves. Medical terminology changes, new products release, regulations update. Build a continuous fine-tuning pipeline:

Monitor retrieval quality metrics in production (click-through rate on retrieved documents, user satisfaction signals)
When quality degrades, collect new labeled pairs from production (use user feedback, query-click data, or human annotation of hard cases)
Fine-tune from the current deployed model (not the original base model) to preserve domain knowledge
Evaluate on held-out benchmark before deploying updated model

Common Mistakes

:::danger Training with only in-batch negatives on small datasets With small datasets and large batch sizes, in-batch negatives may all be obviously irrelevant to each query. The model learns to separate domains rather than learn fine-grained domain distinctions. Use hard negative mining when your dataset is under 10k examples, or use large batch sizes (128+) to increase the chance of challenging in-batch negatives. :::

:::danger Not filtering synthetic data quality LLMs generate queries that are sometimes off-topic, too generic, or essentially paraphrase the passage. Training on these degrades model quality. Always filter synthetic data using a similarity threshold - reject queries where the embedding similarity to their passage is too low (off-topic) or too high (paraphrase). :::

:::warning Fine-tuning for too many epochs Embedding models overfit quickly on small domain datasets. With fewer than 10k training examples, 1-3 epochs is typically optimal. More epochs decrease performance on general tasks without improving domain performance significantly. Monitor your evaluation metric at each epoch and stop early when it plateaus. :::

:::tip Always evaluate on your domain before and after fine-tuning The improvement from fine-tuning varies dramatically by domain and dataset quality. Some domain datasets improve retrieval by 30+ percentage points; others show minimal improvement. Always measure the baseline (general model) and compare to the fine-tuned model on your held-out test set. This tells you if fine-tuning was worth the effort. :::

Interview Q&A

Q1: Why do general embedding models underperform on domain-specific text?

Three reasons. Vocabulary gap: specialized terms (medical abbreviations, legal jargon, technical acronyms) appear rarely in general pre-training data, so the model has weak representations for them. Conceptual framework gap: the same word means different things in different domains - "model" in ML discourse vs general English, "arm" in clinical trials vs anatomy. Asymmetry mismatch: domain Q&A often has short questions and long passage answers that general models weren't trained to handle asymmetrically for the specific domain. The result is embedding models that associate domain-specific queries with irrelevant documents rather than the correct domain-specific passages.

Q2: What is hard negative mining and why is it important?

Hard negatives are training examples that are semantically similar to the query but not actually relevant - the "almost right" documents that force the model to learn fine distinctions. Without hard negatives, training proceeds with random negatives (randomly sampled corpus documents), which are obviously different from the query. The model quickly learns to separate "medical query" from "sports article" but fails to distinguish "metformin mechanism" from "metformin dosage" - because both are about metformin, and both would be easy negatives if sampled randomly.

Hard negative mining works by using an initial embedding model to retrieve the top-k nearest documents for each query, then using these semantically close but non-relevant documents as negatives. This forces the model to learn the subtle distinctions that matter for domain-specific retrieval.

Q3: What is GPL and when would you use it?

GPL (Generative Pseudo Labeling) is a domain adaptation technique for when you have domain text but no labeled query-document pairs. The pipeline: (1) generate synthetic queries from each document using an LLM, (2) retrieve candidate documents for each synthetic query using a general embedding model, (3) score the synthetic positive (the source document) and retrieved candidates using a cross-encoder that estimates relevance, (4) use cross-encoder scores as pseudo-labels to train the embedding model.

Use GPL when: you have domain documents but no labeled query-document pairs, you can't afford a labeling effort, and a general embedding model performs unacceptably on your domain. GPL typically improves retrieval by 5-15 percentage points versus a general model, with no human labels required.

Q4: How many training examples do you need for effective embedding fine-tuning?

With high-quality labeled pairs: 1,000 examples can produce meaningful improvement; 10,000 is sufficient for most domains; 100,000+ yields robust fine-tuning.

With synthetic data: generate 3-10 queries per document and filter for quality. For a corpus of 10,000 documents, this yields 30,000-100,000 training examples.

With GPL: the volume scales with your corpus size. For a corpus of 10,000 documents with 3 queries per document and 5 hard negatives per query, you'd generate ~150,000 training triplets.

Training beyond 3 epochs rarely helps with small datasets; use early stopping based on validation retrieval quality.

Summary

Fine-tuning embedding models for domain-specific text is one of the highest-ROI investments in a RAG pipeline. The improvement from a well-fine-tuned domain model over a general model is typically 10-30 percentage points in retrieval quality.

Key techniques:

Contrastive fine-tuning with (query, positive, negative) triplets: the standard approach when you have labeled pairs
Hard negative mining: significantly improves training signal by using semantically similar but irrelevant documents as negatives
Synthetic data generation: use LLMs to generate queries from your domain passages when you have no labeled data
GPL: full domain adaptation pipeline requiring no labels - generates queries, retrieves candidates, scores with cross-encoder
TSDAE: unsupervised domain adaptation from raw text alone - useful as pre-training before supervised fine-tuning

The standard library is Sentence Transformers. Fine-tune from a strong base model (BGE-large or E5-large) rather than from BERT, as the pre-trained embedding quality matters for the fine-tuning starting point.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Fine-Tuning with Contrastive Loss demo on the EngineersOfAI Playground - no code required.

:::

When the General Model Fails​

Why General Embeddings Underperform on Specialized Text​

Vocabulary mismatch​

Conceptual framework mismatch​

Retrieval asymmetry mismatch​

Quantifying the gap​

Contrastive Fine-Tuning: The Core Approach​

Data format​

The MultipleNegativesRankingLoss​

Hard Negative Mining: Why It Matters​

Example of easy vs hard negatives​

Mining hard negatives​

Synthetic Data Generation with LLMs​

Query generation​

Quality filtering synthetic data​

GPL: Generative Pseudo Labeling​

TSDAE: Unsupervised Fine-Tuning​

Full Worked Example: Biomedical Domain Fine-Tuning​

Production Engineering Notes​

When to fine-tune vs use a general model​

Continuous fine-tuning​

Common Mistakes​

Interview Q&A​

Summary​