Fine-Tuning Embedding Models for Your Domain
Reading time: 28 min | Relevance: AI Engineer, ML Engineer, Research Engineer
When the General Model Fails
You're building a RAG system for a biomedical company. You index 50,000 clinical trial documents. You use BGE-large - top of the MTEB leaderboard, state-of-the-art. You query: "adverse events in phase 3 oncology studies." You retrieve... studies about event planning software. The query "adverse events" retrieves documents about corporate events because "adverse" is rare in general text and the model associates "events" with scheduled gatherings.
You switch to voyage-3-finance. Better, but still wrong - clinical trial language has its own vocabulary that neither financial nor general models understand. "SAE" means serious adverse event, not a financial instrument. "Endpoint" is a clinical outcome, not a network endpoint. "Arms" are trial groups, not body parts.
This is the domain gap problem. General embedding models are trained on general text. When your domain uses specialized terminology, abbreviations, and conceptual frameworks that rarely appear in general pre-training data, general models fail. The solution is fine-tuning - training the embedding model on examples from your domain.
Fine-tuned domain-specific embedding models consistently outperform general models on domain-specific tasks by 10-30 percentage points in retrieval quality. The barrier is data: you need (query, positive document, negative documents) triplets to train with contrastive learning. This lesson covers how to get that data - including when you don't have much labeled data - and how to run the fine-tuning.
Why General Embeddings Underperform on Specialized Text
Vocabulary mismatch
General embedding models learn representations based on word co-occurrence in general text. Specialized vocabulary - clinical terms, legal jargon, programming idioms, financial terminology - appears rarely in general pre-training data. The model has weak or no representations for these terms.
Conceptual framework mismatch
In general text, "model" most commonly refers to a physical or fashion model. In ML discourse, it refers to a trained neural network. The embedding model may not have learned to associate ML-context "model" with training, inference, and evaluation.
Retrieval asymmetry mismatch
In domain Q&A, questions are often short ("What is the standard of care for sepsis?") while answers are long paragraphs from clinical guidelines. General models may not have been trained with this specific asymmetry in mind for your domain.
Quantifying the gap
Benchmark your target domain with a small sample of 100-200 query-document pairs with relevance labels:
from sentence_transformers import SentenceTransformer
import numpy as np
def evaluate_model_on_domain(
model_name: str,
queries: list[str],
relevant_docs: list[str],
corpus: list[str], # All documents including relevant and irrelevant
relevant_indices: list[int], # Index in corpus for each query's relevant doc
) -> dict:
"""
Evaluate an embedding model on domain-specific retrieval.
Computes Recall@1, Recall@5, MRR.
"""
model = SentenceTransformer(model_name)
query_embeddings = model.encode(queries, normalize_embeddings=True, show_progress_bar=True)
corpus_embeddings = model.encode(corpus, normalize_embeddings=True, show_progress_bar=True)
# Compute similarities
similarities = query_embeddings @ corpus_embeddings.T # (n_queries, n_corpus)
recall_at_1 = 0
recall_at_5 = 0
mrr = 0
for i, relevant_idx in enumerate(relevant_indices):
sims = similarities[i]
ranked = np.argsort(-sims)
rank = np.where(ranked == relevant_idx)[0][0] + 1 # 1-indexed rank
if rank == 1:
recall_at_1 += 1
if rank <= 5:
recall_at_5 += 1
mrr += 1.0 / rank
n = len(queries)
return {
"model": model_name,
"recall@1": recall_at_1 / n,
"recall@5": recall_at_5 / n,
"mrr": mrr / n,
}
Contrastive Fine-Tuning: The Core Approach
The standard fine-tuning approach uses labeled (query, positive, negative) triplets with a contrastive loss.
Data format
# Training example format for contrastive fine-tuning
training_example = {
"query": "What are the inclusion criteria for phase 3 oncology trials?",
"positive": "Phase 3 oncology trials typically require ECOG performance status 0-2, "
"adequate organ function, and no prior systemic treatment...",
"negatives": [
"Phase 1 trials focus on dose escalation and safety profiles...",
"Oncology conferences provide networking opportunities for...",
]
}
The MultipleNegativesRankingLoss
The most commonly used loss for embedding fine-tuning is MultipleNegativesRankingLoss from the Sentence Transformers library. It treats all non-matching examples in the batch as negatives - essentially InfoNCE loss:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
def fine_tune_embedding_model(
base_model: str,
train_examples: list[dict], # [{"query": str, "positive": str, "negative": str}]
val_queries: dict[str, str], # {qid: query_text}
val_corpus: dict[str, str], # {docid: doc_text}
val_relevant: dict[str, set[str]], # {qid: set of relevant docids}
output_dir: str = "./fine-tuned-embedding",
epochs: int = 1,
batch_size: int = 64,
warmup_steps: int = 100,
learning_rate: float = 2e-5,
):
"""
Fine-tune an embedding model using contrastive learning.
"""
model = SentenceTransformer(base_model)
# Convert to Sentence Transformers InputExample format
# For MultipleNegativesRankingLoss: (anchor, positive) pairs
# The loss uses all other positives in the batch as negatives
input_examples = []
for ex in train_examples:
# Basic: just anchor-positive pairs (in-batch negatives are free)
input_examples.append(
InputExample(texts=[ex["query"], ex["positive"]])
)
train_dataloader = DataLoader(
input_examples, shuffle=True, batch_size=batch_size
)
# MultipleNegativesRankingLoss = InfoNCE with in-batch negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)
# Evaluation on domain-specific retrieval benchmark
evaluator = InformationRetrievalEvaluator(
queries=val_queries,
corpus=val_corpus,
relevant_docs=val_relevant,
name="domain-retrieval",
score_functions={"cos_sim": lambda x, y: (x @ y.T)},
)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=epochs,
warmup_steps=warmup_steps,
optimizer_params={"lr": learning_rate},
output_path=output_dir,
save_best_model=True,
show_progress_bar=True,
)
return model
Hard Negative Mining: Why It Matters
The quality of negatives is as important as the quality of positives. Random negatives are too easy - the model quickly learns to separate "medical query" from "sports article." Hard negatives - documents that are semantically similar to the query but not the correct answer - force the model to learn finer distinctions.
Example of easy vs hard negatives
Query: "What is the mechanism of action of metformin in Type 2 diabetes?"
Easy negative (random from corpus): "The 2023 UEFA Champions League final was held in Istanbul..." → The model easily learns to separate this from the query.
Hard negative: "Metformin is a biguanide drug used as first-line treatment for Type 2 diabetes, working primarily by activating AMPK pathways to reduce hepatic glucose production." → This is about metformin and diabetes! The model must understand subtle differences to rank the positive higher.
Without hard negatives, the model plateaus quickly. With hard negatives, the model is forced to learn the specific distinctions your task requires.
Mining hard negatives
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
def mine_hard_negatives(
queries: list[str],
positives: list[str], # positives[i] is the positive for queries[i]
corpus: list[str],
model: SentenceTransformer,
n_hard_negatives: int = 5,
positive_similarity_threshold: float = 0.75,
) -> list[dict]:
"""
Mine hard negatives: documents close to the query but not the positive.
Strategy:
1. Embed all queries and corpus documents
2. For each query, find top-50 nearest documents (by cosine similarity)
3. Exclude the positive and documents with similarity > threshold
4. Use the remaining top-n as hard negatives
"""
print("Embedding queries...")
query_embeddings = model.encode(queries, normalize_embeddings=True, batch_size=256)
print("Embedding corpus...")
corpus_embeddings = model.encode(corpus, normalize_embeddings=True, batch_size=256)
# Build FAISS index for fast nearest neighbor search
dim = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(dim) # Inner product (= cosine for normalized vectors)
index.add(corpus_embeddings.astype(np.float32))
# For each query, find nearest documents
k = 100 # Retrieve more than we need, then filter
similarities, doc_indices = index.search(query_embeddings.astype(np.float32), k)
# Build positive lookup
positive_embeddings = model.encode(positives, normalize_embeddings=True, batch_size=256)
positive_sims = (query_embeddings * positive_embeddings).sum(axis=1) # Cosine sim to positive
training_examples = []
for i, (query, positive, pos_sim) in enumerate(zip(queries, positives, positive_sims)):
hard_negs = []
for j, (sim, doc_idx) in enumerate(zip(similarities[i], doc_indices[i])):
doc = corpus[doc_idx]
if doc == positive:
continue # Skip the positive
if sim > positive_similarity_threshold:
# Too similar to the positive - might be a false negative
continue
hard_negs.append((sim, doc))
if len(hard_negs) >= n_hard_negatives:
break
if hard_negs:
training_examples.append({
"query": query,
"positive": positive,
"negatives": [doc for _, doc in hard_negs],
})
print(f"Mined hard negatives for {len(training_examples)}/{len(queries)} queries")
return training_examples
# Using hard negatives with TripletLoss
from sentence_transformers import losses, InputExample
def create_triplet_dataset(training_examples_with_negatives: list[dict]) -> list[InputExample]:
"""Convert hard-negative mined examples to triplet format."""
triplets = []
for ex in training_examples_with_negatives:
for neg in ex["negatives"]:
triplets.append(InputExample(
texts=[ex["query"], ex["positive"], neg]
))
return triplets
Synthetic Data Generation with LLMs
If you don't have labeled query-document pairs, LLMs can generate synthetic queries for your documents - a powerful approach when you have documents but no queries.
Query generation
import anthropic
client = anthropic.Anthropic()
def generate_queries_for_passage(
passage: str,
n_queries: int = 5,
query_type: str = "information-seeking question"
) -> list[str]:
"""
Generate synthetic queries for a document passage using Claude.
These (passage, query) pairs form the basis of training data.
"""
prompt = f"""Below is a passage from a document. Generate {n_queries} diverse
{query_type}s that this passage would be a good answer to.
Requirements:
- Each question should be something a real user would ask
- Questions should be answerable (at least partially) by the passage
- Questions should vary in phrasing and specificity
- Do not use the exact wording from the passage
Passage:
{passage}
Generate exactly {n_queries} questions, one per line, numbered 1-{n_queries}:"""
response = client.messages.create(
model="claude-3-haiku-20240307", # Fast, cheap for bulk generation
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
# Parse numbered list
lines = response.content[0].text.strip().split("\n")
queries = []
for line in lines:
# Remove numbering (1., 2., etc.)
query = line.strip()
for prefix in ["1. ", "2. ", "3. ", "4. ", "5. ", "6. ", "7. ", "8. ", "9. ", "10. "]:
if query.startswith(prefix):
query = query[len(prefix):]
if query and len(query) > 10: # Filter very short queries
queries.append(query)
return queries[:n_queries]
def generate_synthetic_training_data(
passages: list[str],
n_queries_per_passage: int = 3,
batch_size: int = 10,
) -> list[dict]:
"""
Generate synthetic training data for all passages.
Each passage gets n_queries_per_passage synthetic queries.
"""
training_data = []
for i in range(0, len(passages), batch_size):
batch = passages[i:i + batch_size]
for passage in batch:
queries = generate_queries_for_passage(passage, n_queries_per_passage)
for query in queries:
training_data.append({
"query": query,
"positive": passage,
})
if i % 100 == 0:
print(f"Generated queries for {i}/{len(passages)} passages")
return training_data
Quality filtering synthetic data
Not all LLM-generated queries are good training examples. Filter for quality:
def filter_synthetic_data(
training_data: list[dict],
filter_model: SentenceTransformer,
min_query_passage_similarity: float = 0.3,
max_query_length: int = 200,
min_query_length: int = 10,
) -> list[dict]:
"""
Filter synthetic training data for quality.
Remove queries that are too similar to the passage (memorization)
or too dissimilar (off-topic generation).
"""
queries = [ex["query"] for ex in training_data]
passages = [ex["positive"] for ex in training_data]
q_embs = filter_model.encode(queries, normalize_embeddings=True, batch_size=512)
p_embs = filter_model.encode(passages, normalize_embeddings=True, batch_size=512)
similarities = (q_embs * p_embs).sum(axis=1)
filtered = []
for ex, sim in zip(training_data, similarities):
query = ex["query"]
if sim < min_query_passage_similarity:
continue # Query is off-topic
if len(query) < min_query_length or len(query) > max_query_length:
continue # Query is too short or too long
if sim > 0.9:
continue # Query is basically copying the passage
filtered.append(ex)
print(f"Filtered: {len(filtered)}/{len(training_data)} examples kept")
return filtered
GPL: Generative Pseudo Labeling
GPL (Generative Pseudo Labeling, Wang et al. 2021) is a technique for domain adaptation when you have no labeled data at all. It uses an LLM to generate pseudo-relevance labels:
- Generate queries from your domain documents (as above)
- Retrieve candidate documents for each generated query using a general embedding model
- Score candidate documents using a cross-encoder that estimates relevance
- Use cross-encoder scores as labels to train the embedding model with margin ranking loss
from sentence_transformers import CrossEncoder
def gpl_training_data_generation(
corpus: list[str],
bi_encoder_model: str = "BAAI/bge-large-en-v1.5",
cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
n_queries_per_doc: int = 3,
n_negatives: int = 5,
) -> list[dict]:
"""
Generate GPL training data:
1. Generate queries from documents
2. Retrieve candidates using bi-encoder
3. Score with cross-encoder
4. Create margin-based training triplets
"""
bi_encoder = SentenceTransformer(bi_encoder_model)
cross_encoder = CrossEncoder(cross_encoder_model)
# Step 1: Generate synthetic queries
print("Generating synthetic queries...")
synthetic_data = []
for doc in corpus[:100]: # Limit for demonstration
queries = generate_queries_for_passage(doc, n_queries=n_queries_per_doc)
for q in queries:
synthetic_data.append({"query": q, "positive": doc})
# Step 2: Encode corpus with bi-encoder
print("Encoding corpus...")
corpus_embeddings = bi_encoder.encode(
corpus, normalize_embeddings=True, batch_size=256, show_progress_bar=True
)
dim = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(corpus_embeddings.astype(np.float32))
# Step 3 & 4: For each query, find candidates and score with cross-encoder
gpl_examples = []
for example in synthetic_data:
query = example["query"]
query_emb = bi_encoder.encode([query], normalize_embeddings=True)
# Retrieve top candidates (not including the positive)
_, candidate_indices = index.search(query_emb.astype(np.float32), n_negatives + 5)
candidate_docs = [corpus[idx] for idx in candidate_indices[0]
if corpus[idx] != example["positive"]][:n_negatives]
if not candidate_docs:
continue
# Score positive + candidates with cross-encoder
positive_score = cross_encoder.predict([[query, example["positive"]]])[0]
negative_scores = cross_encoder.predict([[query, doc] for doc in candidate_docs])
# Use the candidate with highest cross-encoder score as "hard negative"
best_negative_idx = np.argmax(negative_scores)
best_negative = candidate_docs[best_negative_idx]
best_negative_score = negative_scores[best_negative_idx]
# Only create a training example if positive score > negative score
if positive_score > best_negative_score:
gpl_examples.append({
"query": query,
"positive": example["positive"],
"negative": best_negative,
"margin": positive_score - best_negative_score,
})
return gpl_examples
TSDAE: Unsupervised Fine-Tuning
TSDAE (Transformation-Based Denoising Auto-Encoder for Sentence Embeddings, Wang et al. 2021) is an unsupervised fine-tuning method that doesn't require any labeled pairs. It works by:
- Corrupting input sentences by deleting 60% of tokens randomly
- Training the model to reconstruct the original sentence from the corrupted embedding
- The bottleneck embedding must encode enough information to enable reconstruction → learns good sentence representations
TSDAE works with domain text alone - no labels needed. It's particularly useful as a first step before supervised fine-tuning.
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.datasets import DenoisingAutoEncoderDataset
from torch.utils.data import DataLoader
def tsdae_unsupervised_training(
base_model: str,
domain_sentences: list[str],
output_path: str = "./tsdae-model",
epochs: int = 1,
batch_size: int = 8, # Small batch - TSDAE is memory intensive
deletion_ratio: float = 0.6,
):
"""
Unsupervised domain adaptation using TSDAE.
Only requires domain text - no labels needed.
"""
model = SentenceTransformer(base_model)
# TSDAE corrupts sentences by random token deletion
train_dataset = DenoisingAutoEncoderDataset(
domain_sentences,
noise_fn=lambda t: DenoisingAutoEncoderDataset.delete(t, deletion_ratio)
)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# TSDAE loss: encoder-decoder with denoising objective
train_loss = losses.DenoisingAutoEncoderLoss(
model,
decoder_name_or_path=base_model,
tie_encoder_decoder=True,
)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=epochs,
weight_decay=0,
scheduler="constantlr",
optimizer_params={"lr": 3e-5},
show_progress_bar=True,
output_path=output_path,
)
return model
Full Worked Example: Biomedical Domain Fine-Tuning
Here's a complete pipeline for fine-tuning BGE on medical Q&A:
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
from datasets import load_dataset
import numpy as np
def finetune_bge_medical(
output_dir: str = "./bge-medical",
epochs: int = 2,
):
"""
Fine-tune BGE-large on medical Q&A for better clinical retrieval.
Uses MedQA and PubMedQA as training data sources.
"""
# Step 1: Load base model
base_model = "BAAI/bge-large-en-v1.5"
model = SentenceTransformer(base_model)
# Step 2: Load medical training data
# Using PubMedQA (publicly available medical Q&A dataset)
# In practice, you'd use your own domain data here
dataset = load_dataset("qiaojin/PubMedQA", "pqa_labeled", split="train")
# Convert to (query, positive, negative) format
# PubMedQA provides questions, contexts, and answer labels
train_examples = []
for item in dataset:
question = item["question"]
contexts = item["context"]["contexts"]
labels = item["context"]["labels"]
# Contexts labeled "yes" are positives, others are negatives
positives = [c for c, l in zip(contexts, labels) if l == "yes"]
negatives = [c for c, l in zip(contexts, labels) if l == "no"]
if positives and negatives:
train_examples.append(InputExample(
texts=[question, positives[0], negatives[0]]
))
print(f"Training examples: {len(train_examples)}")
# Step 3: Mine hard negatives using the base model
# (Simplified here - in practice, use the mine_hard_negatives function above)
# Step 4: Set up training with BGE's required instruction prefix
# BGE models use "Represent this sentence: " prefix for encoding
# Note: for training, we don't add the prefix - the training handles this
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
# Use TripletLoss since we have explicit positives and negatives
train_loss = losses.TripletLoss(
model=model,
triplet_margin=0.5, # Minimum margin between positive and negative similarity
)
# Step 5: Evaluate with medical retrieval benchmark
# Create a simple evaluation with held-out examples
eval_size = min(200, len(train_examples) // 5)
eval_examples = train_examples[-eval_size:]
train_examples = train_examples[:-eval_size]
val_queries = {str(i): ex.texts[0] for i, ex in enumerate(eval_examples)}
val_corpus = {str(i): ex.texts[1] for i, ex in enumerate(eval_examples)}
val_relevant = {str(i): {str(i)} for i in range(len(eval_examples))}
evaluator = InformationRetrievalEvaluator(
queries=val_queries,
corpus=val_corpus,
relevant_docs=val_relevant,
name="medical-retrieval",
)
# Step 6: Run training
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=epochs,
warmup_steps=len(train_dataloader) // 10,
output_path=output_dir,
save_best_model=True,
show_progress_bar=True,
evaluation_steps=500,
)
# Step 7: Compare base vs fine-tuned model on medical queries
test_queries = [
"What are the symptoms of acute myocardial infarction?",
"How does RAAS inhibition reduce blood pressure?",
"What is the mechanism of action of statins?",
]
print("\nBase model vs fine-tuned model comparison:")
base = SentenceTransformer(base_model)
finetuned = SentenceTransformer(output_dir)
for query in test_queries:
base_emb = base.encode(query)
ft_emb = finetuned.encode(query)
print(f"\nQuery: {query}")
print(f"Base embedding norm: {np.linalg.norm(base_emb):.3f}")
print(f"Fine-tuned embedding norm: {np.linalg.norm(ft_emb):.3f}")
# In practice: compare retrieval quality on a held-out test set
return finetuned
Production Engineering Notes
When to fine-tune vs use a general model
| Scenario | Recommendation |
|---|---|
| General web text, English | BGE-large or E5-large, no fine-tuning |
| Specialized domain, have 10k+ labeled pairs | Fine-tune on labeled data |
| Specialized domain, have text but no labels | TSDAE pre-training + GPL |
| Specialized domain, limited labeled data (< 1k) | GPL with synthetic queries |
| Multilingual domain | BGE-M3, consider fine-tuning |
Continuous fine-tuning
Domain language evolves. Medical terminology changes, new products release, regulations update. Build a continuous fine-tuning pipeline:
- Monitor retrieval quality metrics in production (click-through rate on retrieved documents, user satisfaction signals)
- When quality degrades, collect new labeled pairs from production (use user feedback, query-click data, or human annotation of hard cases)
- Fine-tune from the current deployed model (not the original base model) to preserve domain knowledge
- Evaluate on held-out benchmark before deploying updated model
Common Mistakes
:::danger Training with only in-batch negatives on small datasets With small datasets and large batch sizes, in-batch negatives may all be obviously irrelevant to each query. The model learns to separate domains rather than learn fine-grained domain distinctions. Use hard negative mining when your dataset is under 10k examples, or use large batch sizes (128+) to increase the chance of challenging in-batch negatives. :::
:::danger Not filtering synthetic data quality LLMs generate queries that are sometimes off-topic, too generic, or essentially paraphrase the passage. Training on these degrades model quality. Always filter synthetic data using a similarity threshold - reject queries where the embedding similarity to their passage is too low (off-topic) or too high (paraphrase). :::
:::warning Fine-tuning for too many epochs Embedding models overfit quickly on small domain datasets. With fewer than 10k training examples, 1-3 epochs is typically optimal. More epochs decrease performance on general tasks without improving domain performance significantly. Monitor your evaluation metric at each epoch and stop early when it plateaus. :::
:::tip Always evaluate on your domain before and after fine-tuning The improvement from fine-tuning varies dramatically by domain and dataset quality. Some domain datasets improve retrieval by 30+ percentage points; others show minimal improvement. Always measure the baseline (general model) and compare to the fine-tuned model on your held-out test set. This tells you if fine-tuning was worth the effort. :::
Interview Q&A
Q1: Why do general embedding models underperform on domain-specific text?
Three reasons. Vocabulary gap: specialized terms (medical abbreviations, legal jargon, technical acronyms) appear rarely in general pre-training data, so the model has weak representations for them. Conceptual framework gap: the same word means different things in different domains - "model" in ML discourse vs general English, "arm" in clinical trials vs anatomy. Asymmetry mismatch: domain Q&A often has short questions and long passage answers that general models weren't trained to handle asymmetrically for the specific domain. The result is embedding models that associate domain-specific queries with irrelevant documents rather than the correct domain-specific passages.
Q2: What is hard negative mining and why is it important?
Hard negatives are training examples that are semantically similar to the query but not actually relevant - the "almost right" documents that force the model to learn fine distinctions. Without hard negatives, training proceeds with random negatives (randomly sampled corpus documents), which are obviously different from the query. The model quickly learns to separate "medical query" from "sports article" but fails to distinguish "metformin mechanism" from "metformin dosage" - because both are about metformin, and both would be easy negatives if sampled randomly.
Hard negative mining works by using an initial embedding model to retrieve the top-k nearest documents for each query, then using these semantically close but non-relevant documents as negatives. This forces the model to learn the subtle distinctions that matter for domain-specific retrieval.
Q3: What is GPL and when would you use it?
GPL (Generative Pseudo Labeling) is a domain adaptation technique for when you have domain text but no labeled query-document pairs. The pipeline: (1) generate synthetic queries from each document using an LLM, (2) retrieve candidate documents for each synthetic query using a general embedding model, (3) score the synthetic positive (the source document) and retrieved candidates using a cross-encoder that estimates relevance, (4) use cross-encoder scores as pseudo-labels to train the embedding model.
Use GPL when: you have domain documents but no labeled query-document pairs, you can't afford a labeling effort, and a general embedding model performs unacceptably on your domain. GPL typically improves retrieval by 5-15 percentage points versus a general model, with no human labels required.
Q4: How many training examples do you need for effective embedding fine-tuning?
With high-quality labeled pairs: 1,000 examples can produce meaningful improvement; 10,000 is sufficient for most domains; 100,000+ yields robust fine-tuning.
With synthetic data: generate 3-10 queries per document and filter for quality. For a corpus of 10,000 documents, this yields 30,000-100,000 training examples.
With GPL: the volume scales with your corpus size. For a corpus of 10,000 documents with 3 queries per document and 5 hard negatives per query, you'd generate ~150,000 training triplets.
Training beyond 3 epochs rarely helps with small datasets; use early stopping based on validation retrieval quality.
Summary
Fine-tuning embedding models for domain-specific text is one of the highest-ROI investments in a RAG pipeline. The improvement from a well-fine-tuned domain model over a general model is typically 10-30 percentage points in retrieval quality.
Key techniques:
- Contrastive fine-tuning with (query, positive, negative) triplets: the standard approach when you have labeled pairs
- Hard negative mining: significantly improves training signal by using semantically similar but irrelevant documents as negatives
- Synthetic data generation: use LLMs to generate queries from your domain passages when you have no labeled data
- GPL: full domain adaptation pipeline requiring no labels - generates queries, retrieves candidates, scores with cross-encoder
- TSDAE: unsupervised domain adaptation from raw text alone - useful as pre-training before supervised fine-tuning
The standard library is Sentence Transformers. Fine-tune from a strong base model (BGE-large or E5-large) rather than from BERT, as the pre-trained embedding quality matters for the fine-tuning starting point.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Embedding Fine-Tuning with Contrastive Loss demo on the EngineersOfAI Playground - no code required.
:::
