Knowledge Graph Embeddings
The Real Interview Moment
"Google's Knowledge Graph has 500 billion facts about 5 billion entities. Billions of potential triples are missing. How do you predict which facts are true without running another web crawl?"
This is exactly the problem Antoine Bordes and colleagues at Facebook AI Research tackled in 2013. Their insight: if entities and relations can be embedded in a vector space such that valid triples satisfy a geometric constraint, then you can score any potential triple by checking whether it satisfies that constraint. Missing facts can be predicted by finding triples whose embeddings satisfy the constraint even though the triple was never observed.
The 2013 paper Translating Embeddings for Modeling Multi-relational Data (TransE) showed this worked. If you represent each entity as a point in and each relation as a translation vector, then being true means - the head entity translated by the relation vector should land on the tail entity. A strikingly simple idea. TransE could predict missing facts with MRR scores that blew away all previous approaches on benchmark datasets.
The broader impact extended far beyond link prediction. Knowledge graph embeddings became the backbone of how modern AI systems reason about structured world knowledge - powering Google's Knowledge Panel, question answering systems, drug-protein interaction prediction, and knowledge-enhanced language models like ERNIE and KEPLER. When GPT-3 "knows" that Paris is the capital of France, it learned this from text. But for structured factual reasoning at scale, explicit knowledge graph embeddings are still more reliable, more interpretable, and more efficient than neural language model inference.
This lesson builds from the geometric intuition of TransE through the complex-valued rotations of RotatE, and into CompGCN - which marries knowledge graph embeddings with graph neural networks for multi-relational graph convolution.
Why This Exists - Knowledge Graphs Are Incomplete
Freebase, Wikidata, DBpedia, Google Knowledge Graph - all are vastly incomplete despite containing hundreds of millions to hundreds of billions of facts. Studies estimate:
- Freebase (2013): 71% of people in the KG had no known place of birth
- Wikidata (2022): roughly 40% of entities lack basic properties
- Drug-protein interaction KGs: fewer than 10% of known drug-protein pairs have been experimentally validated
Manual completion is intractable. Web-based information extraction is noisy. Knowledge graph completion (KGC) - predicting missing triples from the existing graph structure - is a cleaner, more principled approach.
The core insight making this possible: real-world KGs are not random. They have patterns:
- Symmetry: if person A is a sibling of person B, then B is a sibling of A
- Antisymmetry: if A is the parent of B, then B is not the parent of A
- Inversion: "born in" is the inverse of "birthplace of"
- Composition: if A is the father of B and B is the father of C, then A is the grandfather of C
- 1-N relations: one person can have many nationalities
- N-1 relations: many people share one birthplace
A good embedding method must capture all these relational patterns. As we will see, TransE is good at some and terrible at others, which motivated the evolution toward RotatE and ComplEx.
Historical Context
| Year | Model | Key Innovation |
|---|---|---|
| 2011 | RESCAL | Bilinear model, relation as full matrix |
| 2013 | TransE | Relation as translation vector |
| 2014 | TransR | Relation-specific projection: |
| 2014 | TransH | Projection to hyperplane for 1-N/N-1 relations |
| 2015 | DistMult | Relation as diagonal matrix, symmetric only |
| 2016 | ComplEx | Complex-valued embeddings, handles asymmetry |
| 2018 | ConvE | 2D convolution over entity-relation feature maps |
| 2019 | RotatE | Relation as rotation in complex space, |
| 2020 | CompGCN | GNN for multi-relational graphs, combines GNN + KGE |
TransE (Bordes et al., NeurIPS 2013) is the foundational paper that defined the field. RotatE (Sun et al., ICLR 2019) is the current standard for geometric KG embedding. CompGCN (Vashishth et al., ICLR 2020) is the state-of-the-art approach combining GNNs with KG embeddings.
Knowledge Graph Structure
A knowledge graph consists of:
- : set of entities (nodes) - people, places, concepts, molecules, etc.
- : set of relations (edge types) - "born in," "works at," "is a subclass of," etc.
- : set of triples (facts) - means "head entity has relation to tail entity "
Examples:
Link prediction task: given a query or , predict the missing entity from . This is evaluated by ranking all entities by a score function and measuring how high the true answer ranks.
TransE - Relation as Translation
The Core Idea
TransE (Bordes et al., 2013) embeds entities as vectors and relations as translation vectors . The constraint is:
The scoring function is the negative distance:
Higher score (less negative) = more likely to be a true triple.
Geometric Intuition
Think of entities as points in 2D space. The relation "capital of" is a translation arrow. If you start at "Paris" and follow the "capital of" arrow, you should land near "France". If you start at "Berlin" and follow the same "capital of" arrow, you should land near "Germany". Different head entities, same relation arrow, different tail entities - the arrow represents the semantic direction of the relation.
Training with Margin Loss
TransE trains with a margin-based ranking loss:
where , is the margin, and are corrupted (negative) triples generated by replacing either the head or tail entity with a random entity from .
Where TransE Fails
TransE cannot model several important relation patterns:
Symmetric relations: if is true then should also be true. TransE requires and simultaneously, which implies , collapsing the relation to a trivial zero vector.
1-N relations: if and are both true, TransE requires and , forcing in embedding space.
N-1 relations: similarly, many people born in Paris forces all their embeddings to cluster together - conflating different entities.
These failures motivated TransR, TransH, DistMult, ComplEx, and ultimately RotatE.
TransR and TransH - Overcoming 1-N Limitations
TransR
TransR introduces a relation-specific projection matrix that maps entity embeddings from entity space () into relation space ():
Different relations project entities into different subspaces. A 1-N relation projects many head entities () to the same region in relation space, allowing them to have the same tail without forcing them to be identical in entity space.
TransH
TransH projects entities onto a relation-specific hyperplane (unit normal vector) before applying translation:
Both TransR and TransH improve over TransE on 1-N and N-1 relations but introduce significantly more parameters and computational cost.
DistMult and ComplEx - Bilinear Models
DistMult
DistMult (Yang et al., 2015) uses a bilinear scoring function with a diagonal relation matrix:
Each relation is a vector of scaling factors. Efficient and effective, but symmetric by construction: always. Cannot model antisymmetric relations (parent-of, born-before, etc.).
ComplEx - Complex-Valued Embeddings
Trouillon et al. (2016) extended DistMult to complex numbers. Entities and relations are embedded as . The scoring function uses the real part of the Hermitian dot product:
where is the complex conjugate of . The asymmetry arises because in general - using the conjugate breaks the symmetry constraint of DistMult.
Key insight: complex embeddings have twice the expressive power for the same embedding dimension. A -dimensional complex embedding corresponds to real parameters, but the complex structure enforces a specific coupling between the paired real dimensions that encodes symmetry/antisymmetry naturally.
RotatE - Relation as Rotation in Complex Space
Sun et al. (2019) unified the treatment of relational patterns by representing each relation as an element-wise rotation in complex space.
The Core Equation
Entities are embedded as . Relations are embedded as with the constraint that each component has unit modulus: for all . This means for some angle .
The scoring function is:
where denotes element-wise (Hadamard) product in . The constraint ensures represents a pure rotation in the 2D complex plane for each dimension.
Geometric Intuition
In each dimension , the operation rotates the complex number by angle . The relation defines a sequence of rotations - one per dimension. A valid triple means: rotating by the relation's rotation pattern should produce .
Why RotatE Handles All Relational Patterns
Symmetry (): when or for all , the rotation is its own inverse. and - both are true.
Antisymmetry (): when , rotating by gives , but rotating by gives something different. The rotation direction enforces antisymmetry.
Inversion (): if (complex conjugate - reverse rotation), then implies . Inversion is modeled by conjugate relation embeddings.
Composition (): if and , then . Composition is modeled by element-wise product of relation vectors - rotation composition.
This unified treatment of all four relational patterns is what makes RotatE superior to all previous geometric models. TransE can handle antisymmetry and composition but fails on symmetry and inversion. ComplEx handles symmetry, antisymmetry, and inversion but not composition (due to the non-associativity of its scoring function). RotatE handles all four.
Self-Adversarial Negative Sampling
Standard negative sampling (replace head or tail with a random entity) produces mostly easy negatives. RotatE introduced self-adversarial negative sampling: sample negatives with probability proportional to their current model score:
where is a temperature parameter. High-scoring negatives (the model's current mistakes) are sampled with higher probability. This creates harder negatives as training progresses, dramatically improving convergence and final accuracy.
The RotatE loss with self-adversarial negative sampling:
where is the margin.
ConvE - 2D Convolution for KG Completion
Dettmers et al. (2018) took a radically different approach: instead of geometric operations, use 2D convolution over entity-relation feature maps.
The scoring function:
- Reshape and concatenate and into a 2D feature map
- Apply a bank of 2D convolutional filters
- Flatten and apply a linear layer
- Dot product with tail entity
ConvE is highly parameter-efficient - it achieves strong results with far fewer parameters than TransE or RotatE because convolution creates feature interactions across multiple embedding dimensions simultaneously. However, it lacks the clean geometric interpretation of RotatE.
Evaluation Metrics
Standard evaluation: for each test triple , corrupt it by replacing the tail entity with every entity in . Rank all entities by score. Report:
Mean Reciprocal Rank (MRR):
High MRR means true answers consistently rank near the top. MRR is dominated by high-rank answers.
Hits@k: fraction of queries where the true answer ranks in the top-:
Standard values: Hits@1, Hits@3, Hits@10.
Filtered setting: when computing the rank of the true tail, remove all other known true tails from the ranking. This prevents penalizing a model for ranking another valid triple above the test triple.
Benchmark Results - FB15k-237 and WN18RR
FB15k-237 (Freebase subset, 237 relation types, 310K triples) and WN18RR (WordNet, 11 relation types, 93K triples) are the standard benchmarks. WN18RR is dominated by hierarchical relations (hypernym, hyponym, member meronym). FB15k-237 has diverse relation patterns.
| Model | FB15k-237 MRR | FB15k-237 Hits@10 | WN18RR MRR | WN18RR Hits@10 |
|---|---|---|---|---|
| TransE | 0.294 | 0.465 | 0.226 | 0.501 |
| DistMult | 0.241 | 0.419 | 0.430 | 0.490 |
| ComplEx | 0.247 | 0.428 | 0.440 | 0.510 |
| ConvE | 0.325 | 0.501 | 0.430 | 0.520 |
| RotatE | 0.338 | 0.533 | 0.476 | 0.571 |
| CompGCN | 0.355 | 0.535 | 0.479 | 0.546 |
RotatE substantially outperforms earlier geometric models on both benchmarks. CompGCN improves further on FB15k-237 by incorporating multi-hop neighborhood structure through graph convolution.
CompGCN - GNNs for Multi-Relational Graphs
Vashishth et al. (2020) introduced CompGCN, which combines graph neural networks with knowledge graph embeddings for multi-relational graph convolution.
The Challenge
Standard GNNs (GCN, GraphSAGE) treat all edges as the same type. KGs have hundreds of relation types - edges are typed, and the relation type fundamentally changes the meaning of a message. You cannot apply the same aggregation to "born in" and "works at" edges as if they were identical.
CompGCN Approach
CompGCN maintains both entity embeddings and relation embeddings . Messages incorporate the relation type via a composition operation :
The composition operation can be:
- Subtraction: (TransE-inspired)
- Multiplication: (DistMult-inspired)
- Circular correlation: (ConvE-inspired)
Relation embeddings are updated alongside entity embeddings:
CompGCN Architecture
CompGCN's key insight: pure KGE methods (TransE, RotatE) only look at the direct triple . They miss the multi-hop relational context - the fact that entity participates in 50 different relation types with 200 other entities. GNNs capture this context. CompGCN combines both: GNN layers propagate multi-hop context, KGE decoder exploits geometric structure of the final embeddings.
Full PyTorch RotatE Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Optional
class RotatE(nn.Module):
"""
RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.
Sun et al., ICLR 2019.
Entities: h, t ∈ C^d (stored as real vectors of size 2d)
Relations: r ∈ C^d with |r_i| = 1 (phase angles θ_r, size d)
Score: f(h, r, t) = -‖h ∘ r - t‖
Implementation stores:
- entity embeddings: real+imaginary parts, shape [n_entities, 2*embed_dim]
- relation embeddings: phase angles θ, shape [n_relations, embed_dim]
→ r_i = exp(i·θ_i) = (cos θ_i, sin θ_i)
"""
def __init__(
self,
n_entities: int,
n_relations: int,
embed_dim: int = 500,
gamma: float = 12.0, # margin
adversarial_temp: float = 1.0,
reg_lambda: float = 0.0,
):
super().__init__()
self.n_entities = n_entities
self.n_relations = n_relations
self.embed_dim = embed_dim
self.gamma = gamma
self.adversarial_temp = adversarial_temp
self.reg_lambda = reg_lambda
# Entity embeddings: real and imaginary parts
self.entity_embedding = nn.Embedding(n_entities, 2 * embed_dim)
# Relation embeddings: phase angles (no unit norm constraint needed -
# we use sin/cos to enforce it implicitly)
self.relation_embedding = nn.Embedding(n_relations, embed_dim)
self._init_weights()
def _init_weights(self):
"""Initialize uniformly in [-1, 1] for entities, [0, 2π] for relations."""
nn.init.uniform_(self.entity_embedding.weight, -1.0, 1.0)
nn.init.uniform_(self.relation_embedding.weight, 0, 2 * np.pi)
def _get_complex_embeddings(self, entity_ids: torch.Tensor):
"""
Returns (real, imag) parts of entity embeddings.
entity_ids: [B]
Returns: real [B, d], imag [B, d]
"""
emb = self.entity_embedding(entity_ids) # [B, 2d]
real = emb[:, :self.embed_dim] # [B, d]
imag = emb[:, self.embed_dim:] # [B, d]
return real, imag
def _get_relation_rotation(self, relation_ids: torch.Tensor):
"""
Returns (cos θ, sin θ) for each relation dimension.
Enforces |r_i| = 1 by construction.
relation_ids: [B]
Returns: cos_r [B, d], sin_r [B, d]
"""
theta = self.relation_embedding(relation_ids) # [B, d]
cos_r = torch.cos(theta) # [B, d]
sin_r = torch.sin(theta) # [B, d]
return cos_r, sin_r
def score(
self,
head_ids: torch.Tensor,
rel_ids: torch.Tensor,
tail_ids: torch.Tensor,
) -> torch.Tensor:
"""
Compute RotatE score f(h, r, t) = -‖h ∘ r - t‖.
Positive triples get high (less negative) scores.
Returns: [B] scores
"""
h_re, h_im = self._get_complex_embeddings(head_ids)
t_re, t_im = self._get_complex_embeddings(tail_ids)
cos_r, sin_r = self._get_relation_rotation(rel_ids)
# Complex multiplication: (h_re + i·h_im) * (cos_r + i·sin_r)
# = (h_re·cos_r - h_im·sin_r) + i·(h_re·sin_r + h_im·cos_r)
rot_re = h_re * cos_r - h_im * sin_r # [B, d]
rot_im = h_re * sin_r + h_im * cos_r # [B, d]
# Distance: ‖h ∘ r - t‖
diff_re = rot_re - t_re # [B, d]
diff_im = rot_im - t_im # [B, d]
# L2 norm of complex difference
norm = torch.sqrt(diff_re.pow(2) + diff_im.pow(2) + 1e-9) # [B, d]
return -norm.sum(dim=-1) # [B]
def forward(
self,
pos_heads: torch.Tensor,
pos_rels: torch.Tensor,
pos_tails: torch.Tensor,
neg_heads: torch.Tensor,
neg_rels: torch.Tensor,
neg_tails: torch.Tensor,
) -> torch.Tensor:
"""
Compute RotatE loss with self-adversarial negative sampling.
pos_*: [B] - positive triple components
neg_*: [B, K] - K negative samples per positive triple
Returns: scalar loss
"""
batch_size = pos_heads.size(0)
n_neg = neg_heads.size(1)
# Positive scores
pos_scores = self.score(pos_heads, pos_rels, pos_tails) # [B]
# Negative scores: reshape to [B*K] for batch scoring
neg_h = neg_heads.view(-1)
neg_r = neg_rels.view(-1)
neg_t = neg_tails.view(-1)
neg_scores = self.score(neg_h, neg_r, neg_t).view(batch_size, n_neg) # [B, K]
# Self-adversarial weights: p ∝ exp(α·score)
with torch.no_grad():
neg_weights = F.softmax(
self.adversarial_temp * neg_scores, dim=-1
) # [B, K]
# RotatE loss (equation from paper)
pos_loss = -F.logsigmoid(self.gamma + pos_scores).mean()
neg_loss = -(neg_weights * F.logsigmoid(
-(self.gamma + neg_scores)
)).sum(dim=-1).mean()
loss = (pos_loss + neg_loss) / 2
# L2 regularization on entity embeddings
if self.reg_lambda > 0:
reg = (
self.entity_embedding(pos_heads).pow(2).sum() +
self.entity_embedding(pos_tails).pow(2).sum()
) / (2 * batch_size)
loss = loss + self.reg_lambda * reg
return loss
# ──────────────────────────────────────────────
# Dataset and Training
# ──────────────────────────────────────────────
class KGDataset(torch.utils.data.Dataset):
"""Simple KG dataset returning (head, relation, tail) triples."""
def __init__(self, triples: np.ndarray, n_entities: int, n_neg: int = 256):
self.triples = triples # [N, 3] - head, rel, tail
self.n_entities = n_entities
self.n_neg = n_neg
def __len__(self):
return len(self.triples)
def __getitem__(self, idx):
h, r, t = self.triples[idx]
# Generate negative samples by corrupting head or tail
neg_h = np.random.randint(0, self.n_entities, self.n_neg)
neg_t = np.random.randint(0, self.n_entities, self.n_neg)
# Alternate: corrupt head for half, tail for other half
mask = np.random.random(self.n_neg) < 0.5
neg_heads = np.where(mask, neg_h, h)
neg_tails = np.where(mask, t, neg_t)
return {
'pos_head': h, 'pos_rel': r, 'pos_tail': t,
'neg_heads': neg_heads,
'neg_rels': np.full(self.n_neg, r),
'neg_tails': neg_tails,
}
def train_rotate(
train_triples: np.ndarray,
n_entities: int,
n_relations: int,
embed_dim: int = 500,
gamma: float = 12.0,
epochs: int = 500,
batch_size: int = 1024,
lr: float = 1e-3,
n_neg: int = 256,
):
"""Train RotatE on FB15k-237 or WN18RR."""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dataset = KGDataset(train_triples, n_entities, n_neg)
loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4
)
model = RotatE(
n_entities=n_entities,
n_relations=n_relations,
embed_dim=embed_dim,
gamma=gamma,
adversarial_temp=1.0,
).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.5)
for epoch in range(epochs):
model.train()
total_loss = 0.0
for batch in loader:
pos_h = batch['pos_head'].long().to(device)
pos_r = batch['pos_rel'].long().to(device)
pos_t = batch['pos_tail'].long().to(device)
neg_h = batch['neg_heads'].long().to(device)
neg_r = batch['neg_rels'].long().to(device)
neg_t = batch['neg_tails'].long().to(device)
loss = model(pos_h, pos_r, pos_t, neg_h, neg_r, neg_t)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
scheduler.step()
if epoch % 50 == 0:
print(f"Epoch {epoch:4d} Loss: {total_loss/len(loader):.4f}")
return model
# ──────────────────────────────────────────────
# Evaluation: MRR and Hits@k
# ──────────────────────────────────────────────
@torch.no_grad()
def evaluate_rotate(
model: RotatE,
test_triples: np.ndarray,
all_true_triples: set, # (h, r, t) tuples - for filtered evaluation
device: str = 'cpu',
batch_size: int = 512,
) -> dict:
"""
Evaluate RotatE with filtered MRR and Hits@k.
For each test triple (h, r, t), rank all entities as tail candidates.
"""
model.eval()
model = model.to(device)
n_entities = model.n_entities
all_ranks = []
for i in range(0, len(test_triples), batch_size):
batch = test_triples[i:i+batch_size]
for h, r, t in batch:
# Score all entities as tail
h_ids = torch.full((n_entities,), h, dtype=torch.long, device=device)
r_ids = torch.full((n_entities,), r, dtype=torch.long, device=device)
t_ids = torch.arange(n_entities, dtype=torch.long, device=device)
scores = model.score(h_ids, r_ids, t_ids).cpu().numpy() # [n_entities]
# Filter: mask out other known true tails
true_tail_score = scores[t]
for e in range(n_entities):
if e != t and (h, r, e) in all_true_triples:
scores[e] = -1e9
# Rank of true tail (1-indexed)
rank = int((scores > true_tail_score).sum()) + 1
all_ranks.append(rank)
ranks = np.array(all_ranks)
return {
'MRR': float(np.mean(1.0 / ranks)),
'Hits@1': float(np.mean(ranks <= 1)),
'Hits@3': float(np.mean(ranks <= 3)),
'Hits@10': float(np.mean(ranks <= 10)),
}
Knowledge-Enhanced Language Models
Pure language models learn world knowledge from text implicitly - expensive and unreliable for structured facts. Knowledge graph embeddings enable explicit knowledge injection into language models.
ERNIE (Tsinghua, 2019): integrates entity embeddings from a KG into BERT's token representations. When BERT processes "Paris," the ERNIE layer fuses the token representation with the entity embedding for the Paris entity in the KG. This gives the model access to structured knowledge about Paris (country, population, coordinates, famous landmarks) without requiring that knowledge to be memorized in the transformer weights.
KEPLER (2021): jointly trains KG embedding and language model objectives on the same model. The encoder produces entity representations, and a RotatE-style decoder is applied on top of these representations to predict KG triples. The model simultaneously learns to understand text and to complete the knowledge graph.
K-BERT (2020): injects structured KG triplets as additional tokens into the input sequence before feeding to BERT. If the input is "Paris is beautiful," K-BERT might expand it to "Paris [capital of] France [located in] Europe is beautiful" using the KG. This enriches the context window with structured knowledge.
Applications in Production
Google Knowledge Panel
When you search for a person, place, or concept, Google's Knowledge Panel shows structured facts: birth date, nationality, occupation, related entities. These are pulled from the Google Knowledge Graph (500B facts). Link prediction powers the "People also searched for" and entity disambiguation features - when the search query matches multiple entities, KG embeddings help identify the most likely intended entity.
Drug-Protein Interaction Prediction
Biomedical KGs (DrugBank, STRING, OMIM) encode known interactions between drugs, proteins, diseases, and genes. Missing interactions (drug-protein pairs that have not been experimentally tested) can be predicted by RotatE or ComplEx on the biomedical KG. This is used to prioritize which drug-protein pairs to test in expensive wet lab assays - reducing the experimental search space from millions of possible interactions to a manageable shortlist.
A concrete example: predict whether a drug molecule binds to a specific protein target by finding entities in embedding space that satisfy the triple pattern, even if that specific drug-protein interaction was never observed in training data.
Knowledge Graph Question Answering (KGQA)
Multi-hop KGQA requires following chains of reasoning: "Who is the spouse of the president who won the 2004 election?" requires knowing (Bush, winner of, 2004 election), then (Bush, spouse, Laura Bush). KG embeddings support this via relation composition: . Systems like NSM (Nested Search Module) and EmbedKGQA use KG embeddings as the core reasoning component.
Wikidata at Scale
Wikidata (100M+ entities, 1B+ statements) is the largest freely available structured knowledge base. Training RotatE on full Wikidata is non-trivial:
- Scale: entity embedding table alone is floats. At : 100B parameters just for entity embeddings. Not feasible for a single machine.
- Solution: distributed training across multiple machines with entity embeddings sharded across workers. Frameworks like PyKEEN support this via embedding sharding.
- Hierarchical relations: Wikidata has deep class hierarchies ("dog" is a "mammal" is an "animal"). TransE performs poorly on hierarchical relations because it cannot model the tree structure; hyperbolic embeddings (Poincaré) are better suited.
- Training subsets: in practice, researchers use Wikidata subsets filtered to the most common entity types and relation types (e.g., WikiKG90Mv2 benchmark: 90M entities, 1335 relations).
YouTube Resources
| Title | Channel | Focus |
|---|---|---|
| Knowledge Graph Embeddings (CS224W) | Stanford Online | TransE · RotatE · link prediction theory |
| RotatE Paper Explained | Yannic Kilcher | RotatE math · complex rotation intuition |
| Knowledge Graphs and Embeddings | UC Berkeley EECS | KG structure · completion tasks · benchmark analysis |
| CompGCN: Composition-based GNNs for KGs | Graph ML Community | CompGCN architecture · multi-relational convolution |
| Knowledge Graph Enhanced NLP | Weights and Biases | ERNIE · K-BERT · knowledge injection into LMs |
Common Pitfalls
:::danger Evaluating on FB15k (not FB15k-237) gives inflated results FB15k (the original Freebase benchmark from the TransE paper) has a critical flaw: many test triples can be trivially answered by inverting a training triple. If you see in training, the test triple is trivially answered by reversing. FB15k-237 removes this by filtering near-inverse triples. Always use FB15k-237 and WN18RR for fair comparisons. A model that beats state-of-the-art on FB15k may not improve at all on FB15k-237. :::
:::danger Not using the filtered evaluation setting In the filtered evaluation setting, when computing the rank of true tail for query , you mask out all other known true tails. Without filtering, your model gets penalized for predicting where is also a true triple - the model is correct but gets a bad rank. Always report filtered MRR and filtered Hits@k. Raw (unfiltered) metrics are meaningless for comparison. :::
:::warning TransE for symmetric relations always gives near-zero relation vectors If your KG has many symmetric relations and you use TransE, expect those relation vectors to collapse toward zero during training. The optimization forces and simultaneously, so . The model "learns" these relations but with useless embeddings. Check relation embedding norms during training - if any are near zero, you have a symmetry problem. Switch to RotatE or ComplEx for KGs with symmetric relations. :::
:::warning Large entity embedding tables require careful memory management At dimensions, storing entity embeddings for 100K entities requires bytes = 400 MB. For 1M entities: 4 GB just for entity embeddings. Standard practice: use 16-bit floats (half precision) to halve memory, use gradient checkpointing during backprop, and shard the embedding table across GPUs if needed. Many researchers underestimate this and run out of VRAM during training. :::
Interview Q&A
Q: Why does TransE fail for symmetric relations like "sibling of" or "married to"? Give the mathematical argument.
A: TransE models the constraint for valid triples. For a symmetric relation , both and must be true. TransE requires (from the first triple) and (from the second). Adding both constraints: , which simplifies to , forcing . A zero relation vector means the model cannot distinguish this relation from any other relation that also has , and the scoring function reduces to - completely independent of the relation type. For a KG with many siblings, everyone ends up with nearly identical embeddings. In practice, TransE achieves near-zero performance on symmetric relations while still performing reasonably on antisymmetric ones - you can diagnose this by checking per-relation type MRR in your evaluation.
Q: Explain the RotatE rotation intuition. Why does representing relations as rotations handle composition naturally?
A: RotatE embeds entities in and represents each relation as an element-wise rotation: (unit complex number). A valid triple satisfies - rotating by angle in each dimension lands on . Composition is natural because successive rotations compose by multiplication: if is true () and is true (), then - the composed relation is simply the element-wise product of the two rotation vectors. Inversion is modeled by the complex conjugate: if , then (rotating back by the same angle). Symmetry is when or (rotation by 0 or 180 degrees is its own inverse). This geometric framework unifies all four relation patterns in a single parameterization.
Q: What is the filtered evaluation setting for KG link prediction, and why does it matter?
A: In KG link prediction, for each test triple , you rank all entities by score and measure how high ranks. The problem: many other entities may also be valid answers - may be a true triple in the train or test set. In the unfiltered (raw) setting, if your model ranks a valid triple above the test triple , you get penalized - but the model is technically correct. The filtered setting removes all known true triples from the ranking before computing the rank of the test triple. This prevents unfair penalization and gives a cleaner measure of the model's ability to predict missing facts specifically. Always use filtered MRR and Hits@k for reporting and comparison. The gap between raw and filtered MRR can be substantial (often 0.05-0.15 MRR points) on dense KGs where many entities share the same relation to the head.
Q: How does CompGCN differ from simply running a GNN on a KG? What is the composition operation and why is it needed?
A: Standard GNNs (GCN, GAT) treat all edges as the same type. In a KG with 237 relation types, message passing that ignores the relation type would conflate "born in," "works at," and "founded by" - destroying the meaning of the multi-relational graph. CompGCN addresses this with a composition operation that combines the neighbor entity embedding with the relation embedding before aggregation: . The composition can be subtraction (, inspired by TransE), multiplication (, inspired by DistMult), or circular correlation. This allows different relation types to contribute different messages to the same target node. CompGCN also updates relation embeddings: , so relations are refined by the graph convolution just like entities. The KGE decoder (RotatE, ConvE, etc.) is then applied to the final entity embeddings - benefiting from both multi-hop relational context (from GNN layers) and geometric structure (from the KGE scoring function).
Q: Distinguish knowledge graph completion from knowledge graph construction. When would you use each?
A: Knowledge graph completion (KGC) assumes you have an existing KG and want to predict missing facts (link prediction, triple classification). You use KGE models (TransE, RotatE, CompGCN) trained on the existing triples. The model generalizes by finding entities whose embeddings satisfy the geometric constraints implied by the relation, even if the specific triple was never observed. Knowledge graph construction starts from scratch - extracting entities and relations from text, web pages, tables, or structured sources. Techniques include named entity recognition (NER), relation extraction (RE), entity linking, and coreference resolution. You use KGC after construction - once you have a KG from extraction, it will be incomplete, and KGC fills the gaps. In practice: build a KG from your domain corpus using NLP extraction pipelines, then train RotatE or CompGCN on the resulting triples to predict missing links. For biomedical KG construction from literature: use BioBERT-based RE to extract drug-protein interactions from abstracts, build the KG, then use RotatE to predict interactions that were never mentioned in any paper but are implied by the embedding geometry.
Q: How would you use knowledge graph embeddings in a production question answering system?
A: A practical KGQA pipeline has three stages. First, entity linking: identify entities in the user question and link them to KG entities. "Who directed the film that won Best Picture in 1994?" links "Best Picture 1994" to the Forrest Gump entity in the KG. Second, relation path finding: use the KG structure and relation embeddings to find relevant reasoning paths. For single-hop questions, retrieve entities connected by the queried relation. For multi-hop, compose relation embeddings: to score entities reachable in two steps. Third, answer ranking: score candidate answers using the KGE model and return the top-ranked entity. For production: precompute entity embeddings, store in FAISS or a vector database. At query time, compose query embedding ( for the inferred relation), retrieve nearest entities from FAISS, return top result. Key engineering challenge: relation inference from natural language is hard - the model must map "directed" to the director_of relation, "won Best Picture in 1994" to a specific entity, and combine them correctly. Entity disambiguation and relation extraction quality are the bottlenecks, not the KGE scoring function.
Beyond RotatE - Recent Advances
Temporal Knowledge Graph Embeddings
Real-world facts have timestamps: . Standard KGE models treat all facts as time-independent. Temporal KGE extends the scoring function to include time:
where is a time-dependent relation embedding. ATiSE uses Gaussian distributions over time to model temporal uncertainty. TComplEx extends ComplEx to 4D tensors with a time dimension. For biomedical KGs where drug approvals, clinical trial results, and disease associations have timestamps, temporal KGE dramatically improves prediction of future interactions.
Hyperbolic Knowledge Graph Embeddings
Many KG relations are hierarchical (is-a, subclass-of, part-of). Euclidean space distorts hierarchical structure - representing a tree in requires exponential space. Hyperbolic space (Poincaré ball model) grows exponentially with radius, matching the exponential branching of trees.
MuRP (2019): extends TransE to hyperbolic space. Entities near the origin represent general concepts; entities near the boundary represent specific instances. The translation operation becomes Möbius addition in the Poincaré ball:
where is Möbius addition and is the Poincaré distance. MuRP achieves state-of-the-art on WN18RR (which has deep "hypernym" hierarchies) with significantly lower embedding dimension than Euclidean models.
Reasoning Chains - Multi-Hop Queries
Single-hop link prediction answers . Real questions require multi-hop reasoning: "What is the nationality of the spouse of the person who founded Apple?" requires following a chain , then .
Query2Box (2020): represents conjunctive logical queries as boxes in embedding space - the answer entities should be inside the box. Set intersection is modeled as box intersection. BetaE (2020): represents queries as Beta distributions, enabling reasoning under uncertainty.
These methods extend KGE from simple triple completion to general multi-hop query answering over incomplete KGs.
KGE Model Selection Guide
Quick selection rules:
- Default choice for most KGs with mixed relation types: RotatE
- Deep taxonomic hierarchies (WordNet-style): MuRP or RotH
- KG with temporal data: TComplEx or ATiSE
- Need multi-hop reasoning, not just link prediction: Query2Box or BetaE
- Need GNN-style context aggregation: CompGCN with RotatE decoder
- Very large KG (100M+ entities) with memory constraints: DistMult (4x cheaper than RotatE) or RotatE with half-precision embeddings and sharded training
Production Engineering Notes
Serving KGE at Scale
Entity embedding tables for production KGs are large. For a KG with 5M entities at (RotatE stores floats per entity): bytes = 20 GB. This does not fit in a single GPU's memory.
Embedding sharding: partition entity embeddings across multiple machines. During training, mini-batches that reference entities on different shards trigger network communication. Libraries like PyKEEN and DGL-KE support multi-GPU training with embedding sharding.
Approximate nearest neighbor retrieval: for queries at inference, you need to find the entity that maximizes . Brute force over all 5M entities is slow. Since RotatE's scoring function is approximately , the query vector is and you want the nearest entity to in embedding space. Use FAISS with the query vector and retrieve top-K candidates, then rerank with the exact RotatE score.
Caching hot entities: in practice, a small fraction of entities (popular countries, common occupations, frequent drug targets) appear in the vast majority of queries. Cache their embeddings in fast memory (Redis, in-process dict) for zero-latency retrieval.
import faiss
import numpy as np
def build_kge_faiss_index(entity_embeddings: np.ndarray, use_gpu: bool = True):
"""
Build FAISS index over entity embeddings for fast KGE retrieval.
entity_embeddings: [n_entities, 2*embed_dim] (RotatE real + imaginary)
"""
d = entity_embeddings.shape[1]
# Normalize for cosine similarity (RotatE score ~ -L2 distance in rotated space)
faiss.normalize_L2(entity_embeddings)
# IVF index for large entity sets
n_clusters = min(int(np.sqrt(len(entity_embeddings))), 4096)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, n_clusters, faiss.METRIC_L2)
if use_gpu:
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, index)
index.train(entity_embeddings)
index.add(entity_embeddings)
index.nprobe = 64 # search 64 clusters per query (tradeoff: speed vs recall)
return index
def kge_link_prediction_faiss(
model: RotatE,
head_id: int,
rel_id: int,
faiss_index,
top_k: int = 10,
device: str = 'cpu',
):
"""
Fast link prediction: (head, rel, ?) using FAISS ANN.
1. Compute query = h ∘ r (RotatE rotation)
2. Search FAISS for nearest entity embeddings
3. Rerank top-K with exact RotatE score
"""
model.eval()
with torch.no_grad():
h_id = torch.tensor([head_id], device=device)
r_id = torch.tensor([rel_id], device=device)
# Get real/imaginary parts of head
h_re, h_im = model._get_complex_embeddings(h_id)
cos_r, sin_r = model._get_relation_rotation(r_id)
# Rotated query vector: h ∘ r
q_re = (h_re * cos_r - h_im * sin_r).cpu().numpy() # [1, d]
q_im = (h_re * sin_r + h_im * cos_r).cpu().numpy() # [1, d]
query = np.concatenate([q_re, q_im], axis=-1) # [1, 2d]
faiss.normalize_L2(query)
# ANN search
distances, candidate_ids = faiss_index.search(query, top_k * 5)
candidates = candidate_ids[0]
# Exact rerank with RotatE score
with torch.no_grad():
h_ids = torch.tensor([head_id] * len(candidates), device=device)
r_ids = torch.tensor([rel_id] * len(candidates), device=device)
t_ids = torch.tensor(candidates.tolist(), device=device)
scores = model.score(h_ids, r_ids, t_ids).cpu().numpy()
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [(entity_id, score) for entity_id, score in ranked[:top_k]]
Summary
Knowledge graph embeddings solve a fundamental problem: real-world knowledge graphs contain hundreds of billions of facts but are fundamentally incomplete. By embedding entities and relations in geometric spaces where valid triples satisfy measurable constraints, KGE models predict missing facts with high accuracy.
TransE established the translation-based paradigm but fails on symmetric and 1-N relations. ComplEx extended the approach to complex-valued embeddings, handling asymmetric relations. RotatE unified all four relational patterns - symmetry, antisymmetry, inversion, and composition - by treating relations as rotations in complex space. Self-adversarial negative sampling further improved training by focusing on the model's hardest mistakes.
CompGCN combined the best of both worlds: GNN-based multi-hop context aggregation over the multi-relational graph, feeding into a KGE decoder for triple scoring. This architecture achieves state-of-the-art on FB15k-237 by leveraging both local relational structure and the global embedding geometry.
Beyond the core models, the field continues to evolve: temporal KGE handles time-stamped facts, hyperbolic embeddings handle deep hierarchical relations, and multi-hop query embedding methods (Query2Box, BetaE) extend KGE from simple triple completion to general logical query answering over incomplete knowledge graphs.
In production, KGE models power Google's Knowledge Graph completions, biomedical drug-protein interaction prediction, and knowledge-enhanced language models. The combination of explicit structured knowledge from KGE with implicit parametric knowledge from language models is an active research frontier - one of the most promising paths toward AI systems that reason reliably about structured world knowledge.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.
:::
