Embedding Models - The Landscape
A comprehensive survey of the embedding model ecosystem - SBERT, contrastive learning, SimCSE, E5, BGE, GTE, OpenAI, Voyage AI, Cohere, and the MTEB leaderboard.
A comprehensive survey of the embedding model ecosystem - SBERT, contrastive learning, SimCSE, E5, BGE, GTE, OpenAI, Voyage AI, Cohere, and the MTEB leaderboard.
Reducing embedding storage and search costs - float32 to float16, int8, and binary quantization, Hamming distance search, the rescoring trick, and implementation with FAISS and Qdrant.
Build, deploy, and operate production-grade embedding pipelines - caching, incremental indexing, staleness management, vector DB selection, and cost optimization at scale.
MTEB benchmark deep dive, nDCG@10, Recall@K, MRR, MAP, building domain-specific evaluation sets, running MTEB locally, and avoiding the contamination problem.
Contrastive fine-tuning with triplet loss, hard negative mining, in-batch negatives, synthetic data generation, TSDAE, GPL, and a full worked example on domain adaptation.
Nested embeddings where any prefix of dimensions is informative - training MRL, adaptive retrieval, 10x FLOP reduction, and how OpenAI's text-embedding-3 uses MRL internally.
A complete guide to embeddings - models, evaluation (MTEB), fine-tuning, Matryoshka embeddings, quantization, multimodal embeddings, and production pipelines.
CLIP, SigLIP, ImageBind, ColPali, and CLAP - embedding images, text, audio, and documents in shared vector spaces for cross-modal search and zero-shot classification.
text-embedding-3, Matryoshka training, Voyage AI, Cohere Embed, cost analysis, batch processing patterns, and when to choose API vs self-hosted embeddings.
Embeddings, vector databases, similarity search, RAG pipelines, and production vector search in Python with FAISS, Chroma, Pinecone, and pgvector.
The fundamental concept of embeddings - mapping meaning to geometric space, cosine similarity, Word2Vec, the king-queen analogy, and why dense retrieval replaced keyword search.