What is token embeddings?

How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.

How does embedding space work in practice?

Embedding Spaces covers token embeddings, embedding space, word2vec from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/transformer-architecture/embedding-spaces

What is the difference between token embeddings and word2vec?

See the full breakdown at https://engineersofai.com/docs/llms/transformer-architecture/embedding-spaces

Embedding Spaces

Reading time: ~30 min · Interview relevance: Medium-High · Target roles: ML Engineer, AI Engineer, Research Engineer

The Semantic Map

In 2013, Tomas Mikolov's team at Google published Word2Vec and included a demonstration that became famous: "king" - "man" + "woman" ≈ "queen." The model had learned that word vectors encode semantic relationships geometrically. Gender differences, capital-country relationships, verb tenses - all expressed as consistent vector arithmetic.

The NLP world was stunned. A neural network had learned a map of human concepts in a high-dimensional space, where geometric relationships correspond to semantic relationships. "Rome" - "Italy" + "France" ≈ "Paris." "Swimming" - "swim" + "run" ≈ "running."

This was not programmed. It emerged from training a simple neural network to predict words from context. The geometry was a consequence of the statistical structure of language.

Transformer token embeddings work the same way, only deeper. While Word2Vec learned a single static vector per word, transformers learn a starting point - an initial embedding that gets transformed by each attention layer into a context-specific representation. "Bank" starts with the same embedding whether it's in "river bank" or "Bank of America", but after 6 attention layers, its representation has been updated to incorporate context and reflects which meaning is active.

Understanding embedding spaces is understanding what a language model "thinks" a word means - before it has seen any context, and how that representation evolves through the network.

The Token Embedding Table

At the input of every transformer, there is an embedding table: a matrix $E \in \mathbb{R}^{V \times d_{model}}$ where $V$ is the vocabulary size and $d_{model}$ is the embedding dimension.

Each row $E_i$ is the embedding vector for token $i$ - a $d_{model}$ -dimensional vector of learned floating-point values. This is a learnable lookup table: given token ID $i$ , return vector $E_i$ .

import torch
import torch.nn as nn
import numpy as np


# The embedding table is just an nn.Embedding
vocab_size = 32000  # LLaMA-2 vocabulary size
d_model = 4096      # LLaMA-2 7B embedding dimension

embedding = nn.Embedding(vocab_size, d_model)

print(f"Embedding table shape: {embedding.weight.shape}")  # (32000, 4096)
print(f"Total parameters: {embedding.weight.numel():,}")  # 131,072,000 (~131M params)

# Lookup: given token IDs, get their embeddings
token_ids = torch.tensor([[1, 2, 3, 4, 5]])  # batch=1, seq=5
embeddings = embedding(token_ids)
print(f"Embedding output: {embeddings.shape}")  # (1, 5, 4096)

# The scale of initial embeddings matters
# Standard initialization: N(0, 1) scaled
# Transformer practice: multiply embeddings by sqrt(d_model) for scale compensation
d_model_small = 512
emb_small = nn.Embedding(1000, d_model_small)
x = emb_small(torch.tensor([42]))
x_scaled = x * (d_model_small ** 0.5)  # Scale as in original transformer

print(f"\nRaw embedding L2 norm: {x.norm().item():.3f}")
print(f"Scaled embedding L2 norm: {x_scaled.norm().item():.3f}")

The embedding table for LLaMA-2 7B is 131 million parameters - comparable in size to an entire small language model. It's a significant portion of the total parameter count. For a 7B model with most parameters in the 32 transformer layers, the embedding table is ~2% of total params but used at every forward pass.

Geometric Intuition: What "Embedding" Means

The word "embedding" is borrowed from mathematics: an embedding maps objects from one space into another while preserving structure. A token embedding maps a discrete token ID into a continuous vector space.

The key insight: similar tokens should be close in this space. "Dog" and "puppy" should have nearby vectors. "Run" and "running" should be close. "Paris" and "Rome" should be close, and far from "protein".

How does this structure emerge? Through gradient descent on the next-token prediction objective. If "dog" and "puppy" often appear in the same contexts (they're interchangeable in many sentences), the gradient signal will push their embeddings to become similar - because the model gets similar loss on "My [dog/puppy] is friendly" and learns to produce similar probability distributions.

The remarkable fact: high-dimensional spaces allow many simultaneous relationships. A 512-dimensional space can encode dozens of semantic relationships independently: gender, number, tense, syntax class, semantic category, formality - all simultaneously, as different geometric dimensions.

Similarity in Embedding Space

The standard similarity metric for embeddings is cosine similarity:

$\text{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$

This measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction). The norm of the vector is ignored - only direction matters.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def get_token_embeddings(model_name: str = 'bert-base-uncased'):
    """Load BERT and extract its token embedding table."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # The embedding table is at model.embeddings.word_embeddings.weight
    embedding_table = model.embeddings.word_embeddings.weight.detach()
    return tokenizer, embedding_table


def cosine_similarity_matrix(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    """Compute pairwise cosine similarity between rows of A and B."""
    A_norm = F.normalize(A, dim=-1)
    B_norm = F.normalize(B, dim=-1)
    return A_norm @ B_norm.T


def find_nearest_tokens(
    query_word: str,
    tokenizer,
    embedding_table: torch.Tensor,
    top_k: int = 10,
) -> list:
    """Find tokens with highest cosine similarity to query_word's embedding."""
    # Get query token ID and its embedding
    query_ids = tokenizer.encode(query_word, add_special_tokens=False)
    if len(query_ids) != 1:
        print(f"Warning: '{query_word}' tokenizes to {len(query_ids)} tokens")

    query_id = query_ids[0]
    query_embedding = embedding_table[query_id].unsqueeze(0)  # (1, d_model)

    # Compute cosine similarity against all vocabulary entries
    all_normalized = F.normalize(embedding_table, dim=-1)
    query_normalized = F.normalize(query_embedding, dim=-1)
    similarities = (query_normalized @ all_normalized.T).squeeze()  # (vocab_size,)

    # Get top-k (excluding the query itself)
    top_values, top_indices = similarities.topk(top_k + 1)
    results = []
    for val, idx in zip(top_values[1:], top_indices[1:]):  # Skip the query itself
        token = tokenizer.decode([idx.item()])
        results.append((token.strip(), val.item()))

    return results


# Demonstrate (requires BERT to be downloaded)
# tokenizer, embedding_table = get_token_embeddings('bert-base-uncased')
# print("Nearest tokens to 'king':")
# for token, sim in find_nearest_tokens('king', tokenizer, embedding_table):
#     print(f"  {token}: {sim:.4f}")

# Demonstrate manual vector arithmetic
def word_analogy(
    word_a: str,
    word_b: str,
    word_c: str,
    tokenizer,
    embedding_table: torch.Tensor,
    top_k: int = 5,
) -> list:
    """
    Find word D such that A:B :: C:D
    i.e., D = B - A + C (the king - man + woman = queen analogy)
    """
    def get_embedding(word):
        ids = tokenizer.encode(word, add_special_tokens=False)
        return embedding_table[ids[0]]

    # Compute target vector: A - B + C direction
    vec_a = get_embedding(word_a)
    vec_b = get_embedding(word_b)
    vec_c = get_embedding(word_c)
    target = vec_b - vec_a + vec_c  # king - man + woman

    # Normalize
    target_norm = F.normalize(target.unsqueeze(0), dim=-1)
    all_norm = F.normalize(embedding_table, dim=-1)
    similarities = (target_norm @ all_norm.T).squeeze()

    # Get top-k
    top_vals, top_idxs = similarities.topk(top_k + 3)
    results = []
    exclude = {word_a.lower(), word_b.lower(), word_c.lower()}
    for val, idx in zip(top_vals, top_idxs):
        token = tokenizer.decode([idx.item()]).strip()
        if token.lower() not in exclude:
            results.append((token, val.item()))
            if len(results) >= top_k:
                break

    return results

Weight Tying: Input and Output Embeddings

In many transformer models, the input embedding table and the output projection matrix are shared (the same weights).

The output projection maps from $d_{model}$ to $V$ (the vocabulary size) to produce next-token probabilities. Its transpose has shape $V \times d_{model}$ - exactly the same as the input embedding table.

Press & Wolf (2017) showed that sharing these matrices ("weight tying") gives better performance and reduces parameters. The intuition: the input embedding should encode "what this token means," and the output projection should score "how likely is this token to appear next." These are related tasks - tokens with similar meanings should have similar input embeddings and similar output scores.

class TiedEmbeddingTransformer(nn.Module):
    """
    Transformer with tied input and output embeddings.
    The same matrix E is used for both input embedding lookup
    and output projection (logits computation).
    """

    def __init__(self, vocab_size: int, d_model: int, num_layers: int = 6):
        super().__init__()

        # Shared embedding table
        self.embedding = nn.Embedding(vocab_size, d_model)

        # Transformer layers (simplified)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=d_model, nhead=8, batch_first=True),
            num_layers=num_layers,
        )

        # NO separate output projection - we reuse embedding.weight
        # The output logits are: hidden @ embedding.weight.T

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # Scale embeddings
        x = self.embedding(token_ids) * (self.embedding.embedding_dim ** 0.5)

        # Process through transformer
        x = self.transformer(x)  # (batch, seq, d_model)

        # Compute logits using transposed embedding matrix (weight tying)
        # Shape: (batch, seq, d_model) @ (d_model, vocab_size) = (batch, seq, vocab_size)
        logits = x @ self.embedding.weight.T

        return logits


# Parameter savings from weight tying
vocab_size = 32000
d_model = 4096

tied_params = vocab_size * d_model  # One embedding table
untied_params = 2 * vocab_size * d_model  # Separate input and output matrices

print(f"Without weight tying: {untied_params:,} params in embeddings")
print(f"With weight tying:    {tied_params:,} params in embeddings")
print(f"Savings:              {untied_params - tied_params:,} params")
print(f"Savings in memory:    {(untied_params - tied_params) * 2 / 1e9:.2f} GB (float16)")
# For LLaMA-2 7B: saves ~131M params and ~262 MB

GPT-2 and many BERT-class models use weight tying. LLaMA-2 does not (separate input and output embeddings). The choice affects both parameter count and the training dynamics.

The Anisotropy Problem

A surprising and practically important property of transformer embeddings: they are anisotropic. This means the token embeddings don't uniformly fill the embedding space - they cluster in a narrow cone, leaving most of the space unused.

Li et al. (2020) measured this in BERT: the average cosine similarity between random pairs of BERT token embeddings is 0.99 - nearly identical direction for most token pairs. This is extremely unusual. If embeddings were uniformly distributed on a unit sphere, the average pairwise similarity would be approximately 0.

Why does this happen?

Frequency effects: The most common tokens (function words: "the", "a", "is") are updated most often during training. They converge to dominant directions. Less frequent tokens are "pulled" toward these attractors by the gradient.
Softmax collapse: The output projection (logits) must produce valid probability distributions. Softmax normalization creates implicit pressure toward certain geometric configurations.
Training dynamics: In the early phases of training, a small number of tokens dominate (the most common ones), and their gradients reshape the embedding space. Later tokens are squeezed into the remaining capacity.

Why does it matter?

Cosine similarity, the standard metric for embedding comparison, is nearly meaningless when all embeddings point in the same direction. If you compute semantic similarity between "cat" and "airplane", BERT's raw embeddings give ~0.98 - nearly identical. This is wrong - they're semantically very different.

Solutions:

BERT-whitening: Transform embeddings to have zero mean and identity covariance (PCA whitening). Dramatically improves cosine similarity meaningfulness.
SimCSE (Gao et al., 2021): Contrastive learning objective that forces embeddings to be more isotropic and semantically meaningful.
Use sentence-level embeddings: Instead of raw token embeddings, pool over all tokens (mean pooling or [CLS] token) after a fine-tuned model. Much better semantic coherence.

def fix_anisotropy_whitening(embeddings: torch.Tensor) -> torch.Tensor:
    """
    Apply whitening transformation to embeddings to reduce anisotropy.
    Transforms embeddings to have zero mean and unit covariance.

    Args:
        embeddings: (N, d) tensor of embedding vectors
    Returns:
        whitened: (N, d) whitened embeddings
    """
    # Center
    mean = embeddings.mean(dim=0, keepdim=True)
    centered = embeddings - mean

    # Compute covariance
    cov = (centered.T @ centered) / (embeddings.shape[0] - 1)

    # Eigendecomposition
    eigenvalues, eigenvectors = torch.linalg.eigh(cov)

    # Whitening transform: W = eigenvectors @ diag(1/sqrt(eigenvalues))
    eps = 1e-8
    whitening_matrix = eigenvectors @ torch.diag(1.0 / (eigenvalues + eps).sqrt())

    # Apply
    whitened = centered @ whitening_matrix
    return whitened


# Demonstrate anisotropy
import torch

# Simulate anisotropic embeddings (as BERT produces)
d = 768
N = 1000

# Real BERT embeddings would cluster around a dominant direction
# Simulate by biasing all embeddings toward a common direction
dominant_dir = torch.randn(d)
dominant_dir = dominant_dir / dominant_dir.norm()

# Add dominant direction with large coefficient + small random component
raw_embeddings = 3 * dominant_dir.unsqueeze(0) + 0.1 * torch.randn(N, d)

# Compute average pairwise cosine similarity (sample 100 pairs)
normalized = F.normalize(raw_embeddings, dim=-1)
sim_matrix = normalized[:100] @ normalized[:100].T
avg_sim = (sim_matrix.sum() - sim_matrix.trace()) / (100 * 99)
print(f"Average pairwise cosine similarity (raw): {avg_sim:.4f}")

# Apply whitening
whitened = fix_anisotropy_whitening(raw_embeddings)
whitened_norm = F.normalize(whitened, dim=-1)
sim_whitened = whitened_norm[:100] @ whitened_norm[:100].T
avg_sim_whitened = (sim_whitened.sum() - sim_whitened.trace()) / (100 * 99)
print(f"Average pairwise cosine similarity (whitened): {avg_sim_whitened:.4f}")
# Should be much lower - closer to 0 for isotropic embeddings

Embedding Dimension and Model Capacity

The embedding dimension $d_{model}$ is one of the primary levers for model capacity.

Small $d_{model}$ (e.g., 256): Each token's representation is a 256-dimensional vector. This limits the number of independent semantic features that can be encoded. Fine for simple classification, insufficient for complex language understanding.

Large $d_{model}$ (e.g., 4096, 12288): Each token lives in a much higher-dimensional space. More independent features can be simultaneously represented. Better for complex reasoning, multilingual understanding, long-range dependencies.

The catch: $d_{model}$ multiplies everything. Attention is $O(d_{model}^2)$ per head. FFN is $O(d_{model}^2)$ . The embedding table is $O(V \times d_{model})$ . Doubling $d_{model}$ roughly quadruples the parameter count and compute.

Model	$d_{model}$	Notable reason for choice
BERT-base	768	Empirically good for classification tasks
BERT-large	1024	Better performance at higher compute
GPT-2 small	768	Matches BERT for comparison
GPT-3	12,288	Pushes capacity for in-context learning
LLaMA-2 7B	4,096	Compute-optimal at 7B param scale
LLaMA-2 70B	8,192	Compute-optimal at 70B param scale

Visualizing Embedding Space

t-SNE and UMAP are the standard tools for visualizing high-dimensional embeddings in 2D.

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt


def visualize_embeddings_tsne(
    words: list[str],
    embeddings: np.ndarray,
    categories: list[str] = None,
    title: str = "Token Embedding Space",
    perplexity: int = 30,
):
    """
    Visualize a set of word embeddings using t-SNE.

    Args:
        words: List of word labels
        embeddings: (N, d) numpy array of embedding vectors
        categories: Optional list of category labels for color coding
    """
    # t-SNE dimensionality reduction
    tsne = TSNE(
        n_components=2,
        perplexity=min(perplexity, len(words) - 1),
        random_state=42,
        max_iter=1000,
    )
    embeddings_2d = tsne.fit_transform(embeddings)

    # Plot
    fig, ax = plt.subplots(figsize=(12, 8))

    if categories:
        unique_cats = list(set(categories))
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_cats)))
        cat_to_color = {cat: color for cat, color in zip(unique_cats, colors)}

        for word, (x, y), cat in zip(words, embeddings_2d, categories):
            ax.scatter(x, y, color=cat_to_color[cat], s=100, alpha=0.7)
            ax.annotate(word, (x, y), fontsize=9, ha='center', va='bottom')
    else:
        ax.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.7)
        for word, (x, y) in zip(words, embeddings_2d):
            ax.annotate(word, (x, y), fontsize=9, ha='center', va='bottom')

    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.set_xlabel("t-SNE dimension 1")
    ax.set_ylabel("t-SNE dimension 2")
    plt.tight_layout()
    plt.show()


# Example groups that should cluster together in a trained model
word_groups = {
    "Capitals": ["London", "Paris", "Berlin", "Tokyo", "Beijing", "Moscow"],
    "Animals": ["cat", "dog", "lion", "tiger", "elephant", "whale"],
    "Tech": ["python", "javascript", "kubernetes", "docker", "tensorflow"],
    "Verbs": ["run", "jump", "swim", "fly", "walk", "sprint"],
}

all_words = [w for group in word_groups.values() for w in group]
categories = [cat for cat, words in word_groups.items() for _ in words]

# In a real scenario, fetch embeddings from BERT or similar
# For demonstration, generate synthetic embeddings with clustering
np.random.seed(42)
n_groups = len(word_groups)
synthetic_embeddings = np.vstack([
    np.random.randn(len(words), 768) * 0.1 + np.random.randn(768) * 2
    for words in word_groups.values()
])

# print("Visualizing embedding space (would display a plot)...")
# visualize_embeddings_tsne(all_words, synthetic_embeddings, categories)

Production Engineering Notes

Embedding Layer Memory

For LLaMA-2 7B: vocabulary 32K × d_model 4096 × 2 bytes (bfloat16) ≈ 256 MB for the embedding table. This is always loaded into GPU memory, regardless of batch size. For small GPU setups, this is non-trivial.

For LLaMA-3 8B: vocabulary 128K × 4096 × 2 bytes ≈ 1 GB just for embeddings. The larger vocabulary significantly increases embedding memory.

Frozen vs Fine-tuned Embeddings

When fine-tuning a large model on a small dataset:

Freeze embeddings: The embedding table is well-trained on billions of tokens. Updating it on a small dataset risks overwriting general-purpose representations with task-specific ones.
Fine-tune embeddings: Necessary if the fine-tuning data contains domain-specific vocabulary (medical, legal, code) that has different semantics than general internet text.

Standard practice: freeze embedding layers for first few epochs, then optionally unfreeze for later epochs.

Embedding as a Feature Extractor

BERT embeddings (contextual representations from the final layer) are widely used as general-purpose features for downstream tasks without full fine-tuning:

# Extract BERT embeddings for a batch of texts
def extract_embeddings_bert(
    texts: list[str],
    model,
    tokenizer,
    pooling: str = 'mean',  # 'mean', 'cls', or 'max'
) -> torch.Tensor:
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors='pt',
    )

    with torch.no_grad():
        outputs = model(**inputs)

    # outputs.last_hidden_state: (batch, seq, d_model)
    hidden_states = outputs.last_hidden_state

    if pooling == 'cls':
        # Take [CLS] token representation
        return hidden_states[:, 0, :]
    elif pooling == 'mean':
        # Mean of all non-padding tokens
        attention_mask = inputs['attention_mask'].unsqueeze(-1).float()
        sum_embeddings = (hidden_states * attention_mask).sum(dim=1)
        count = attention_mask.sum(dim=1)
        return sum_embeddings / count
    elif pooling == 'max':
        return hidden_states.max(dim=1).values

Mean pooling consistently outperforms [CLS] for semantic similarity tasks. The [CLS] representation was designed for classification (where the model aggregates into [CLS]), not for sentence similarity.

Common Mistakes

:::danger Using raw token embeddings for semantic similarity Raw token embeddings (before any transformer layers) are reasonable starting points but poor similarity metrics. A better approach: pass text through the full transformer and use the final layer's representations. Even better: use a model specifically fine-tuned for semantic similarity (sentence-transformers, E5, etc.). :::

:::warning Ignoring the scale of embeddings The original transformer multiplies embeddings by $\sqrt{d_{model}}$ before adding positional encoding. If you skip this scaling, the positional encoding signal dominates the token embedding signal (positional encodings have values in $[-1, 1]$ , but token embeddings may have values in $[-0.1, 0.1]$ after initialization). Always apply the scale factor. :::

:::tip Anisotropy makes cosine similarity unreliable on raw BERT embeddings If you're building a semantic search system and using raw BERT [CLS] embeddings with cosine similarity, your results may be poor. Use a sentence-transformers model (which is fine-tuned specifically for similarity with contrastive learning) or apply whitening. The anisotropy of raw BERT embeddings makes cosine similarity nearly meaningless - almost every pair has 0.99 similarity. :::

Interview Q&A

Q1: What is a token embedding, and what does it mean geometrically?

Answer: A token embedding is a vector $e_i \in \mathbb{R}^{d_{model}}$ that represents the "meaning" of token $i$ before any context is applied. It's the row $i$ of a learned lookup table $E \in \mathbb{R}^{V \times d_{model}}$ .

Geometrically, it's a point in $d_{model}$ -dimensional space. The key property: semantically similar tokens are near each other (high cosine similarity). "Run" and "sprint" should be close. "Cat" and "feline" should be close. "Cat" and "photosynthesis" should be far.

This structure emerges from training. If two tokens appear in similar contexts (they're substitutable in many sentences), gradient descent pushes their embeddings toward each other. Over millions of training steps on billions of tokens, the geometry reflects the statistical structure of language.

The geometric relationships are consistent: analogies like "king - man + woman ≈ queen" hold because gender is consistently encoded as a direction in the space, and royalty is another direction. These are not programmed - they emerge from optimizing next-token prediction.

Q2: What is weight tying in transformer models? Why is it beneficial?

Answer: Weight tying means sharing the input embedding matrix $E$ with the output projection matrix $W_{out}$ , so that $W_{out} = E^T$ .

Why beneficial: the input embedding encodes "what is the meaning of token $i$ ?". The output projection decodes "what is the probability of token $i$ appearing next?". Both tasks benefit from the same underlying representation of tokens.

Mathematically: without tying, you have $V \times d_{model}$ input parameters and $V \times d_{model}$ output parameters - $2V \times d_{model}$ total. With tying, just $V \times d_{model}$ - a 2× reduction in these matrices.

Empirically: Press & Wolf (2017) showed that tied models achieve lower perplexity than untied models at equal parameter count. The shared representation is a useful inductive bias - tokens that are similar inputs should also have similar output probabilities.

Trade-off: the tied constraint means the embedding space must simultaneously serve two purposes. In some models, this constraint hurts (when input and output distributions are very different). LLaMA-2, for example, does not use weight tying.

Q3: What is anisotropy in transformer embeddings and why does it matter?

Answer: Anisotropy means the embedding vectors are not uniformly distributed across the embedding space - they all point in similar directions. In BERT, the average cosine similarity between random token embedding pairs is ~0.99, nearly 1.0. In an ideal isotropic space, it should be near 0.

Why it happens: High-frequency tokens (function words: "the", "a", "is") have strong gradients and dominate the embedding space geometry. Rare tokens get pulled toward these dominant directions. The output softmax creates additional geometric constraints. These combine to cluster most embeddings in a narrow cone.

Why it matters:

Cosine similarity becomes meaningless - "cat" and "airplane" might have 0.98 similarity, indistinguishable from "cat" and "dog" (0.995)
The effective dimensionality of the embedding space is much lower than $d_{model}$
Models trained for retrieval/similarity that rely on raw embeddings will perform poorly

Practical solutions:

Use a model fine-tuned for similarity (SimCSE, sentence-transformers) - contrastive fine-tuning explicitly forces isotropic representations
Apply whitening post-hoc (subtract mean, apply PCA whitening)
Use mean-pooled embeddings from the full transformer rather than the raw embedding table

Q4: How does the embedding dimension affect model capacity and what are the tradeoffs?

Answer: $d_{model}$ is the width of the model - every representation, every key/value/query vector, every FFN input - is $d_{model}$ -dimensional.

Increasing $d_{model}$ :

More dimensions = more independent semantic features can be encoded simultaneously
Better at representing complex, multi-faceted meanings
Higher capacity for the FFN key-value memory
Cost: computation scales as $O(d_{model}^2)$ for attention and FFN - doubling $d_{model}$ roughly quadruples compute
Memory: embedding table is $V \times d_{model}$ ; KV cache is $2 \times L \times n \times d_{model}$ per sequence

Scaling laws perspective (Kaplan et al., 2020): At fixed total parameter budget, the optimal choice of $d_{model}$ (vs number of layers vs number of heads) follows empirical scaling laws. Wider models (higher $d_{model}$ ) are better for shorter training runs; deeper models (more layers) are better with more compute.

In practice: modern LLMs use $d_{model} = 4096$ (7B models) to $d_{model} = 12288$ (175B models). The ratio of $d_{model}$ to number of layers is roughly constant across scales.

Q5: What is the difference between a token embedding and a contextual embedding?

Answer:

Token embedding (static): the raw vector from the embedding lookup table $E_i$ . Same for every occurrence of token $i$ , regardless of context. "Bank" in "river bank" and "bank account" has the exact same token embedding.

Contextual embedding (dynamic): the representation of token $i$ after passing through some or all transformer layers. This vector incorporates information from all surrounding tokens via attention. "Bank" in "river bank" has a very different contextual embedding than "bank" in "bank account" - the attention layers have updated it based on context.

The transformer is fundamentally about converting token embeddings (static, context-free) into contextual embeddings (dynamic, context-aware). Each attention layer refines the representation by mixing in information from relevant context tokens.

This is the key advantage over Word2Vec, which produced only static embeddings. BERT's and GPT's contextual embeddings can represent polysemy (words with multiple meanings), negation, coreference - all context-dependent phenomena that static embeddings cannot capture.

For downstream tasks: use contextual embeddings (from the final layer of the transformer, after running the full model) not raw token embeddings. Contextual embeddings are 2× to 10× better for semantic similarity benchmarks.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Semantic Map​

The Token Embedding Table​

Geometric Intuition: What "Embedding" Means​

Similarity in Embedding Space​

Weight Tying: Input and Output Embeddings​

The Anisotropy Problem​

Embedding Dimension and Model Capacity​

Visualizing Embedding Space​

Production Engineering Notes​

Embedding Layer Memory​

Frozen vs Fine-tuned Embeddings​

Embedding as a Feature Extractor​

Common Mistakes​

Interview Q&A​

Q1: What is a token embedding, and what does it mean geometrically?​

Q2: What is weight tying in transformer models? Why is it beneficial?​

Q3: What is anisotropy in transformer embeddings and why does it matter?​

Q4: How does the embedding dimension affect model capacity and what are the tradeoffs?​

Q5: What is the difference between a token embedding and a contextual embedding?​