The fundamental concept of embeddings - mapping meaning to geometric space, cosine similarity, Word2Vec, the king-queen analogy, and why dense retrieval replaced keyword search.

How does Word2Vec work in practice?

What Are Embeddings and Why They Matter covers embeddings, Word2Vec, cosine similarity from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/embeddings-engineering/what-are-embeddings

What is the difference between embeddings and cosine similarity?

See the full breakdown at https://engineersofai.com/docs/llms/embeddings-engineering/what-are-embeddings

What Are Embeddings and Why They Matter

Reading time: 20 min | Relevance: AI Engineer, ML Engineer, Research Engineer

The Search Engine That Couldn't Find "Heart Attack"

It's 2010. A medical information website receives 50,000 queries per day. One frequent query: "chest tightness, left arm pain, sweating." The website's keyword search returns zero results because no article contains the phrase "chest tightness, left arm pain, sweating." Articles do contain the phrase "myocardial infarction" and "heart attack symptoms." But keyword search doesn't know that these phrases mean the same thing. To keyword search, "heart attack" and "myocardial infarction" are unrelated strings of characters.

The users searching for "chest tightness, left arm pain, sweating" are probably people who think they're having a heart attack. They need information urgently. The search system fails them completely because it works at the character level, not the meaning level.

This is the fundamental problem that embeddings solve. Embeddings are a way to represent text (or images, audio, or any other data) as a point in a high-dimensional geometric space, where points close together have similar meanings. A good embedding model would place "heart attack symptoms," "myocardial infarction signs," and "chest tightness left arm pain" all close together in this space - because they mean nearly the same thing - even though they share almost no words.

The shift from keyword-based to embedding-based retrieval is one of the most consequential changes in how AI systems work. Modern RAG pipelines, semantic search engines, recommendation systems, and classification systems are built on this idea. Understanding embeddings deeply - how they're created, what they capture, and what they don't - is fundamental to building effective AI systems.

Why This Exists - The Failure of Sparse Representations

Before embeddings, the dominant way to represent text in machine learning was sparse representations - high-dimensional vectors with mostly zeros.

Bag of words: Represent a document as a vector of word counts. A vocabulary of 100,000 words produces a 100,000-dimensional vector. For most documents, 99.9% of entries are zero. The "Jurassic Park" document vector and the "dinosaur biology" document vector share almost no non-zero entries, even though they're highly topically related.

TF-IDF: An improvement - weight terms by their document frequency to downweight common words. Still fundamentally sparse. Still no understanding of synonymy. "Automobile" and "car" are unrelated in TF-IDF space.

BM25: The best sparse retrieval algorithm. Used in Elasticsearch, Lucene, and most traditional search engines. Still sparse. Still fundamentally character-based, not meaning-based.

The problems with sparse representations:

Vocabulary mismatch: If the query uses different words than the document, similarity score is low even if meaning matches
No synonym handling: "Doctor" and "physician" are different dimensions in the vector
No paraphrase handling: "The cat sat on the mat" and "A feline rested on the carpet" share zero words
High dimensionality: 100k+ dimensional vectors are expensive to store and compare
No transfer: Every domain needs its own vocabulary

Dense embeddings solve all five problems.

Historical Context

2003 - Neural language models (Bengio et al.) first learn continuous word representations as a byproduct of language modeling. The representations are dense and continuous, not sparse.

2013 - Mikolov et al. at Google publish Word2Vec - the first practically efficient algorithm for learning high-quality word embeddings from large corpora. Word2Vec's efficiency and the quality of its embeddings trigger an explosion of interest.

2014 - GloVe (Pennington et al.) extends Word2Vec using global co-occurrence statistics. Together, Word2Vec and GloVe establish dense word embeddings as a standard tool.

2017-2018 - The Transformer architecture emerges. BERT (Devlin et al. 2018) learns contextual embeddings - the same word gets different embeddings in different contexts. "Bank" in "river bank" and "bank account" have different BERT embeddings.

2019 - Reimers and Gurevych publish Sentence-BERT (SBERT), which modifies BERT to produce high-quality sentence-level embeddings efficient for semantic similarity. This is the inflection point for practical semantic search.

2022-2024 - The embedding model landscape explodes: E5, BGE, GTE, Voyage AI, Cohere Embed v3, OpenAI text-embedding-3. The MTEB benchmark standardizes comparison. Embedding quality improves by 20+ percentage points on retrieval tasks in two years.

Word2Vec: The Breakthrough That Made Embeddings Practical

Mikolov et al.'s Word2Vec (2013) was not the first neural word embedding, but it was the first that was both high-quality and computationally efficient. Understanding Word2Vec builds the intuition for all later embedding models.

The core idea

Word2Vec learns embeddings by training a neural network to predict words from their context (CBOW: Continuous Bag of Words) or context from a word (Skip-gram). The intuition: words that appear in similar contexts have similar meanings. "Doctor" and "physician" appear in similar sentences - near words like "hospital," "patient," "treatment." So they should have similar embeddings.

The training objective for Skip-gram:

$\max \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$

where $c$ is the context window size and the probability is modeled as:

$P(w_O | w_I) = \frac{\exp(\mathbf{v}_{w_O}^T \mathbf{v}_{w_I})}{\sum_{w=1}^{W} \exp(\mathbf{v}_w^T \mathbf{v}_{w_I})}$

After training, the input embedding vectors $\mathbf{v}_w$ capture semantic relationships.

The king-queen analogy

The most famous property of Word2Vec embeddings is linear arithmetic over meaning:

$\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$

This is remarkable. The model was never told that "king" relates to "man" as "queen" relates to "woman." It learned this relationship purely from the co-occurrence statistics of these words in text. The "royalty" concept and the "gender" concept are encoded as separable directions in the embedding space.

import numpy as np

# Simulating the king-queen analogy
# In reality you'd load pretrained Word2Vec embeddings

def demonstrate_word_analogy(word_vectors: dict[str, np.ndarray]):
    """
    Demonstrate the classic Word2Vec analogy:
    king - man + woman ≈ queen
    """
    king = word_vectors["king"]
    man = word_vectors["man"]
    woman = word_vectors["woman"]

    # Arithmetic on embedding vectors
    result_vector = king - man + woman

    # Find the word closest to result (by cosine similarity)
    def cosine_similarity(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    # Compare to all words
    similarities = {
        word: cosine_similarity(result_vector, vec)
        for word, vec in word_vectors.items()
        if word not in {"king", "man", "woman"}
    }

    most_similar = max(similarities, key=similarities.get)
    print(f"king - man + woman = {most_similar} (similarity: {similarities[most_similar]:.3f})")
    # Expected output: king - man + woman = queen (similarity: 0.856)

# Other discovered analogies in Word2Vec:
# Paris - France + Germany ≈ Berlin  (capital cities)
# walked - walk + run ≈ ran          (verb tense)
# bigger - big + small ≈ smaller     (comparative form)

What the geometry means

The key insight from Word2Vec: semantic relationships can be expressed as vector arithmetic. This means:

Synonyms have similar vectors (high cosine similarity)
Antonyms are often in predictable directions from each other
Category membership corresponds to clusters in the space
Analogical relationships correspond to consistent vector offsets

This geometric interpretation is what makes embeddings so powerful - you can do arithmetic on meaning.

From Words to Sentences to Documents

Word2Vec produces word-level embeddings. For most applications, you need sentence or document embeddings.

The naive approach: averaging

The simplest sentence embedding is the average (or sum) of its word embeddings:

$\mathbf{s} = \frac{1}{|S|} \sum_{w \in S} \mathbf{v}_w$

This is fast and sometimes surprisingly good. But it loses word order entirely. "The dog bit the man" and "The man bit the dog" have identical average embeddings.

BERT: contextual embeddings

BERT learns contextual embeddings - the representation of a word depends on its surrounding context. The BERT representation of "bank" in "I went to the bank to withdraw money" is different from "bank" in "The river bank was muddy."

For sentence representation with BERT, two approaches are common:

CLS token: Use the representation of the special [CLS] token as the sentence embedding. BERT places this token at the start of every sequence and, during pre-training, it aggregates sentence-level information for Next Sentence Prediction.

Mean pooling: Average the representations of all tokens. Often works better than CLS token for semantic similarity tasks.

import torch
from transformers import AutoTokenizer, AutoModel

def get_bert_sentence_embedding(text: str, model_name: str = "bert-base-uncased") -> np.ndarray:
    """Get a sentence embedding using BERT with mean pooling."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()

    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

    with torch.no_grad():
        outputs = model(**inputs)

    # Mean pooling: average token embeddings (excluding padding)
    token_embeddings = outputs.last_hidden_state  # (batch, seq, hidden)
    attention_mask = inputs["attention_mask"]       # (batch, seq)

    # Mask out padding tokens, then average
    mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * mask_expanded, 1)
    sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask

    return mean_embeddings[0].numpy()  # Shape: (hidden_size,)

Problem with raw BERT for sentence similarity: Raw BERT (without fine-tuning for similarity) produces embeddings that perform poorly for semantic similarity - better than random, but not practically useful for search. The reason: BERT was trained on MLM (masked language modeling) and NSP (next sentence prediction), which don't directly optimize for embedding similarity.

SBERT: the turning point

Sentence-BERT (Reimers and Gurevych, 2019) fine-tunes BERT with a Siamese network architecture specifically for semantic similarity. Instead of passing two sentences through BERT independently and comparing, SBERT passes both through the same model with shared weights, and minimizes a contrastive or cosine similarity loss on the pooled embeddings.

The result: SBERT produces embeddings where cosine similarity directly corresponds to semantic similarity. This made practical semantic search possible.

Cosine Similarity: Why Angle Matters More Than Magnitude

The standard similarity metric for embeddings is cosine similarity, not Euclidean distance:

$\cos(\theta) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} = \frac{\sum_{i} u_i v_i}{\sqrt{\sum_i u_i^2} \cdot \sqrt{\sum_i v_i^2}}$

Cosine similarity is bounded in $[-1, 1]$ for signed embeddings, or $[0, 1]$ for embeddings trained to be non-negative.

Why not Euclidean distance?

Euclidean distance $\|\mathbf{u} - \mathbf{v}\|_2$ is sensitive to vector magnitude (length). Two vectors pointing in the same direction but with different magnitudes are similar semantically but distant Euclidean-wise. This causes problems because:

Long documents tend to have higher-magnitude embeddings than short ones (if you sum rather than average)
Embedding models trained with different normalization schemes produce different magnitude distributions
Retrieval systems using Euclidean distance can systematically prefer short over long documents

Cosine similarity normalizes out the magnitude, measuring only the angle between vectors. Two vectors pointing in exactly the same direction have cosine similarity 1, regardless of their magnitudes.

L2 normalization trick: If you L2-normalize all embeddings before storing them, inner product (dot product) becomes equivalent to cosine similarity. This is important for efficiency: many vector database operations (especially FAISS) support inner product search with GPU acceleration, making normalized dot product much faster than computing full cosine similarity.

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Standard cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def l2_normalize(v: np.ndarray) -> np.ndarray:
    """L2-normalize a vector."""
    norm = np.linalg.norm(v)
    if norm < 1e-10:
        return v
    return v / norm

def efficient_similarity_with_normalized_vectors(
    query: np.ndarray,
    corpus: np.ndarray  # (n_docs, dim)
) -> np.ndarray:
    """
    Fast similarity computation using pre-normalized vectors.
    Once normalized, dot product = cosine similarity.
    This is much faster than computing cosine similarity from scratch.
    """
    q_normalized = l2_normalize(query)
    # Assuming corpus is already normalized:
    similarities = corpus @ q_normalized  # (n_docs,) - vectorized dot product
    return similarities

# Example
query_embedding = np.random.randn(768)
doc_embeddings = np.random.randn(10000, 768)

# Normalize corpus once (at indexing time)
norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
doc_embeddings_normalized = doc_embeddings / norms

# Fast similarity at query time
sims = efficient_similarity_with_normalized_vectors(
    query_embedding, doc_embeddings_normalized
)
top_10_indices = np.argsort(sims)[-10:][::-1]

Embedding Dimensions: 768, 1536, 3072

Common embedding dimensions and what they trade off:

Model	Dimensions	Notes
Word2Vec (small)	100–300	Fast, simple
SBERT (BERT-base)	768	Standard
SBERT (BERT-large)	1024	Better quality, slower
OpenAI text-embedding-3-small	1536	API standard
OpenAI text-embedding-3-large	3072	Highest quality, Matryoshka-trained
E5-large	1024	Strong open-source
BGE-large	1024	Competitive open-source
Voyage-3	1024	Strong commercial

Higher dimensionality:

Can capture more nuanced semantic distinctions
Requires more storage (each float32 costs 4 bytes)
Makes similarity search slower (more operations per comparison)
Can overfit if the training data doesn't support the dimensionality

For most RAG applications, 768–1024 dimensions is sufficient. OpenAI's text-embedding-3-large at 3072 dims is state-of-the-art but 4× the storage cost of a 768-dim model.

The Semantic Search Revolution

Dense embeddings enabled semantic search - retrieval based on meaning rather than keyword overlap. This change is more fundamental than it first appears.

Before dense retrieval

Traditional search systems:

Index documents by keywords (BM25, TF-IDF)
Match query keywords against document keywords
Rank by keyword importance and frequency

Failure modes:

Query "cheap flights" misses documents about "affordable airfare" or "budget airline tickets"
Query "heart attack symptoms" misses "myocardial infarction signs"
Long-tail medical, legal, and technical queries systematically underperform

After dense retrieval

Embedding-based search:

Encode all documents into a dense vector index
Encode the query into the same space
Retrieve documents by approximate nearest neighbor search

Advantages:

Synonym-insensitive: "heart attack" and "myocardial infarction" land in the same region
Paraphrase-robust: different phrasings of the same question retrieve the same documents
Cross-lingual (with multilingual models): queries in French retrieve English documents
Natural language queries: "how do I format a date in Python" works without using "format" or "date" in the documents

Dense retrieval in RAG pipelines

The typical RAG (Retrieval-Augmented Generation) pipeline:

The embedding model is critical at two points: encoding documents during indexing and encoding queries at retrieval time. Both use the same model (or asymmetric models for bi-encoder setups).

Use Cases Beyond RAG

Embeddings are used far beyond RAG:

Semantic deduplication: Find near-duplicate documents in a large corpus. Embed all documents, cluster by cosine similarity, identify clusters where multiple documents cover the same content. Essential for cleaning training datasets.

Classification without fine-tuning: Embed query + candidate labels, classify by finding the most similar label. Zero-shot text classification with no task-specific training.

Clustering: Group documents by topic, customer support tickets by issue type, code by functionality. K-means and hierarchical clustering work directly on embeddings.

Anomaly detection: Fit a distribution over normal embeddings; flag outliers as anomalous. Used for fraud detection, quality filtering, and data validation.

Recommendation systems: Represent users and items in the same embedding space. Recommend items close to a user's embedding.

Common Mistakes

:::danger Using raw BERT embeddings for semantic similarity Raw BERT (without fine-tuning for similarity) performs surprisingly poorly on semantic similarity tasks. The BERT training objective (MLM, NSP) doesn't optimize for embedding space geometry. Always use a model fine-tuned for similarity (SBERT variants, E5, BGE) rather than raw BERT. :::

:::danger Using Euclidean distance for embedding comparison Euclidean distance is sensitive to vector magnitude, which correlates with document length and other irrelevant factors. Use cosine similarity instead. In practice, normalize embeddings to unit L2 norm at indexing time, then use inner product (which equals cosine similarity for normalized vectors) - much faster on modern hardware. :::

:::warning Using the same embedding model for all domains General embedding models are trained on general text. If your domain is specialized (medical, legal, financial, code), general models will underperform significantly. Evaluate domain-specific performance before deploying, and consider fine-tuning if performance is insufficient. Lesson 04 covers this in depth. :::

:::warning Treating embedding similarity as exact semantic equivalence Cosine similarity above 0.9 means the embeddings are very close, not that the documents say the same thing. Near-duplicate embeddings can have meaningful differences in nuance, specificity, or correctness. For high-stakes applications, use a cross-encoder reranker on top of embedding retrieval rather than relying on embedding similarity alone. :::

:::tip Normalize embeddings before storing in a vector database Normalizing embeddings to unit L2 norm before indexing converts cosine similarity to inner product, enabling faster search operations. Most vector databases (Qdrant, Weaviate, Pinecone) support inner product similarity natively and can use highly optimized SIMD operations for it. This is a simple change with significant throughput improvement. :::

Interview Q&A

Q1: What is an embedding and what problem does it solve?

An embedding is a dense vector representation of data (text, image, audio) in a high-dimensional space, where vectors close together represent semantically similar content. The fundamental problem it solves is the vocabulary mismatch problem in keyword search: keyword-based retrieval only finds documents that share words with the query. "Heart attack" doesn't retrieve documents about "myocardial infarction" even though they mean the same thing. Embeddings map meaning to geometric space, so semantically equivalent content lands near each other regardless of word choice. This enables semantic search, where you retrieve by meaning rather than by keyword overlap.

Q2: Why is cosine similarity used instead of Euclidean distance?

Cosine similarity measures the angle between two vectors, ignoring their magnitudes. Euclidean distance measures the absolute distance between two points, which is sensitive to magnitude. In embedding space, magnitude often correlates with irrelevant factors - document length (longer documents tend to have higher-magnitude embeddings if you sum word vectors), or specific training details of the embedding model. Two vectors pointing in exactly the same direction are semantically identical regardless of their lengths. Cosine similarity captures this; Euclidean distance doesn't. In practice, normalizing to unit L2 norm before indexing and using inner product (equivalent to cosine similarity for normalized vectors) gives the best of both worlds: semantic correctness and computational efficiency.

Q3: What is the king-queen analogy and why is it significant?

The king-queen analogy: in Word2Vec embedding space, vector(king) - vector(man) + vector(woman) ≈ vector(queen). The same relationship holds for Paris - France + Germany ≈ Berlin, walked - walk + run ≈ ran, and many others. It's significant because it demonstrates that embedding spaces capture semantic relationships as linear geometric relationships. The model was never explicitly taught that king:man :: queen:woman - it learned this from co-occurrence statistics in text. This means you can do arithmetic on meaning: add, subtract, and interpolate concepts. It established the foundational insight that dense vector representations can encode semantic structure in geometrically useful ways.

Q4: What made SBERT different from raw BERT for similarity tasks?

Raw BERT was pre-trained for MLM (predict masked tokens) and NSP (predict whether two sentences are consecutive). These objectives don't optimize the embedding space for semantic similarity - vectors for semantically similar sentences are not necessarily close together in raw BERT space. SBERT fine-tuned BERT using a Siamese network architecture with a similarity objective: pairs of semantically similar sentences are pushed close together, dissimilar sentences are pushed apart. After fine-tuning, cosine similarity in the embedding space directly correlates with semantic similarity. Additionally, SBERT made bi-encoder search practical: each sentence is embedded independently once, and similarity is computed with a single dot product. Raw BERT required running the cross-encoder on every (query, candidate) pair - $O(n)$ BERT forward passes per query, prohibitively slow for large corpora.

Q5: What are the key use cases for embeddings beyond RAG?

Beyond RAG: (1) Semantic deduplication - find near-duplicate documents in training datasets by clustering on embedding similarity; essential for dataset quality control. (2) Zero-shot classification - embed a text and candidate category labels, classify by nearest label embedding; no task-specific training data needed. (3) Clustering - group customer support tickets, documents, or code by semantic similarity; K-means directly on embedding space. (4) Anomaly detection - fit a normal distribution over embeddings; flag outliers as anomalous; used for fraud detection and quality filtering. (5) Recommendation - represent users and items in shared embedding space, recommend items close to the user's embedding. (6) Cross-lingual retrieval - with multilingual embedding models, match French queries to English documents.

Summary

Embeddings are dense vector representations where semantic similarity corresponds to geometric proximity. They solve the vocabulary mismatch problem that makes keyword search fail on specialized queries.

Key concepts:

Word2Vec (2013): First practical dense word embeddings; demonstrated semantic arithmetic (king - man + woman = queen)
Contextual embeddings (BERT, 2018): Same word gets different embeddings in different contexts
SBERT (2019): Fine-tuned BERT for semantic similarity; made practical semantic search possible
Cosine similarity: Standard metric - measures angle, not magnitude; normalize vectors for efficient computation
Dimensions: 768 (standard), 1536 (OpenAI default), 3072 (OpenAI large); more dims = more nuance but more storage

Embeddings are the foundation for semantic search, RAG pipelines, recommendation systems, clustering, and classification. The rest of this module covers how to select, evaluate, fine-tune, and deploy them in production.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Search Engine That Couldn't Find "Heart Attack"​

Why This Exists - The Failure of Sparse Representations​

Historical Context​

Word2Vec: The Breakthrough That Made Embeddings Practical​

The core idea​

The king-queen analogy​

What the geometry means​

From Words to Sentences to Documents​

The naive approach: averaging​

BERT: contextual embeddings​

SBERT: the turning point​

Cosine Similarity: Why Angle Matters More Than Magnitude​

Why not Euclidean distance?​

Embedding Dimensions: 768, 1536, 3072​

The Semantic Search Revolution​

Before dense retrieval​

After dense retrieval​

Dense retrieval in RAG pipelines​

Use Cases Beyond RAG​

Common Mistakes​

Interview Q&A​

Summary​