Skip to main content

Knowledge Graph Embeddings

The Real Interview Moment

"Google's Knowledge Graph has 500 billion facts about 5 billion entities. Billions of potential triples are missing. How do you predict which facts are true without running another web crawl?"

This is exactly the problem Antoine Bordes and colleagues at Facebook AI Research tackled in 2013. Their insight: if entities and relations can be embedded in a vector space such that valid triples (h,r,t)(h, r, t) satisfy a geometric constraint, then you can score any potential triple by checking whether it satisfies that constraint. Missing facts can be predicted by finding triples whose embeddings satisfy the constraint even though the triple was never observed.

The 2013 paper Translating Embeddings for Modeling Multi-relational Data (TransE) showed this worked. If you represent each entity as a point in Rd\mathbb{R}^d and each relation as a translation vector, then (h,r,t)(h, r, t) being true means h+rth + r \approx t - the head entity translated by the relation vector should land on the tail entity. A strikingly simple idea. TransE could predict missing facts with MRR scores that blew away all previous approaches on benchmark datasets.

The broader impact extended far beyond link prediction. Knowledge graph embeddings became the backbone of how modern AI systems reason about structured world knowledge - powering Google's Knowledge Panel, question answering systems, drug-protein interaction prediction, and knowledge-enhanced language models like ERNIE and KEPLER. When GPT-3 "knows" that Paris is the capital of France, it learned this from text. But for structured factual reasoning at scale, explicit knowledge graph embeddings are still more reliable, more interpretable, and more efficient than neural language model inference.

This lesson builds from the geometric intuition of TransE through the complex-valued rotations of RotatE, and into CompGCN - which marries knowledge graph embeddings with graph neural networks for multi-relational graph convolution.


Why This Exists - Knowledge Graphs Are Incomplete

Freebase, Wikidata, DBpedia, Google Knowledge Graph - all are vastly incomplete despite containing hundreds of millions to hundreds of billions of facts. Studies estimate:

  • Freebase (2013): 71% of people in the KG had no known place of birth
  • Wikidata (2022): roughly 40% of entities lack basic properties
  • Drug-protein interaction KGs: fewer than 10% of known drug-protein pairs have been experimentally validated

Manual completion is intractable. Web-based information extraction is noisy. Knowledge graph completion (KGC) - predicting missing triples from the existing graph structure - is a cleaner, more principled approach.

The core insight making this possible: real-world KGs are not random. They have patterns:

  • Symmetry: if person A is a sibling of person B, then B is a sibling of A
  • Antisymmetry: if A is the parent of B, then B is not the parent of A
  • Inversion: "born in" is the inverse of "birthplace of"
  • Composition: if A is the father of B and B is the father of C, then A is the grandfather of C
  • 1-N relations: one person can have many nationalities
  • N-1 relations: many people share one birthplace

A good embedding method must capture all these relational patterns. As we will see, TransE is good at some and terrible at others, which motivated the evolution toward RotatE and ComplEx.


Historical Context

YearModelKey Innovation
2011RESCALBilinear model, relation as full matrix
2013TransERelation as translation vector h+rth + r \approx t
2014TransRRelation-specific projection: hMr+rtMrhM_r + r \approx tM_r
2014TransHProjection to hyperplane for 1-N/N-1 relations
2015DistMultRelation as diagonal matrix, symmetric only
2016ComplExComplex-valued embeddings, handles asymmetry
2018ConvE2D convolution over entity-relation feature maps
2019RotatERelation as rotation in complex space, t=hrt = h \circ r
2020CompGCNGNN for multi-relational graphs, combines GNN + KGE

TransE (Bordes et al., NeurIPS 2013) is the foundational paper that defined the field. RotatE (Sun et al., ICLR 2019) is the current standard for geometric KG embedding. CompGCN (Vashishth et al., ICLR 2020) is the state-of-the-art approach combining GNNs with KG embeddings.


Knowledge Graph Structure

A knowledge graph G=(E,R,T)\mathcal{G} = (\mathcal{E}, \mathcal{R}, \mathcal{T}) consists of:

  • E\mathcal{E}: set of entities (nodes) - people, places, concepts, molecules, etc.
  • R\mathcal{R}: set of relations (edge types) - "born in," "works at," "is a subclass of," etc.
  • TE×R×E\mathcal{T} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}: set of triples (facts) - (h,r,t)(h, r, t) means "head entity hh has relation rr to tail entity tt"

Examples:

  • (Paris,capitalOf,France)(Paris, \text{capitalOf}, France)
  • (Ibuprofen,treatsCondition,Inflammation)(Ibuprofen, \text{treatsCondition}, Inflammation)
  • (BERT,developedBy,Google)(BERT, \text{developedBy}, Google)
  • (Albert Einstein,bornIn,Ulm)(Albert\ Einstein, \text{bornIn}, Ulm)

Link prediction task: given a query (h,r,?)(h, r, ?) or (?,r,t)(?, r, t), predict the missing entity from E\mathcal{E}. This is evaluated by ranking all entities E\mathcal{E} by a score function f(h,r,e)f(h, r, e) and measuring how high the true answer ranks.


TransE - Relation as Translation

The Core Idea

TransE (Bordes et al., 2013) embeds entities as vectors h,tRdh, t \in \mathbb{R}^d and relations as translation vectors rRdr \in \mathbb{R}^d. The constraint is:

h+rtfor valid triples (h,r,t)h + r \approx t \quad \text{for valid triples } (h, r, t)

The scoring function is the negative distance:

f(h,r,t)=h+rtp(p=1 or 2)f(h, r, t) = -\|h + r - t\|_p \quad (p = 1 \text{ or } 2)

Higher score (less negative) = more likely to be a true triple.

Geometric Intuition

Think of entities as points in 2D space. The relation "capital of" is a translation arrow. If you start at "Paris" and follow the "capital of" arrow, you should land near "France". If you start at "Berlin" and follow the same "capital of" arrow, you should land near "Germany". Different head entities, same relation arrow, different tail entities - the arrow represents the semantic direction of the relation.

Training with Margin Loss

TransE trains with a margin-based ranking loss:

L=(h,r,t)T(h,r,t)T[γ+f(h,r,t)f(h,r,t)]+\mathcal{L} = \sum_{(h,r,t) \in \mathcal{T}} \sum_{(h',r,t') \in \mathcal{T}^-} \left[\gamma + f(h,r,t) - f(h',r,t')\right]_+

where []+=max(0,)[\cdot]_+ = \max(0, \cdot), γ>0\gamma > 0 is the margin, and T\mathcal{T}^- are corrupted (negative) triples generated by replacing either the head or tail entity with a random entity from E\mathcal{E}.

Where TransE Fails

TransE cannot model several important relation patterns:

Symmetric relations: if (A,sibling,B)(A, \text{sibling}, B) is true then (B,sibling,A)(B, \text{sibling}, A) should also be true. TransE requires A+r=BA + r = B and B+r=AB + r = A simultaneously, which implies r=0r = 0, collapsing the relation to a trivial zero vector.

1-N relations: if (Albert Einstein,nationality,Germany)(Albert\ Einstein, \text{nationality}, Germany) and (Albert Einstein,nationality,Switzerland)(Albert\ Einstein, \text{nationality}, Switzerland) are both true, TransE requires h+rt1h + r \approx t_1 and h+rt2h + r \approx t_2, forcing GermanySwitzerlandGermany \approx Switzerland in embedding space.

N-1 relations: similarly, many people born in Paris forces all their embeddings to cluster together - conflating different entities.

These failures motivated TransR, TransH, DistMult, ComplEx, and ultimately RotatE.


TransR and TransH - Overcoming 1-N Limitations

TransR

TransR introduces a relation-specific projection matrix MrRk×dM_r \in \mathbb{R}^{k \times d} that maps entity embeddings from entity space (Rd\mathbb{R}^d) into relation space (Rk\mathbb{R}^k):

hr=hMr,tr=tMrh_r = h M_r, \quad t_r = t M_r f(h,r,t)=hr+rtr2f(h, r, t) = -\|h_r + r - t_r\|^2

Different relations project entities into different subspaces. A 1-N relation projects many head entities (h1,h2,h_1, h_2, \ldots) to the same region in relation space, allowing them to have the same tail without forcing them to be identical in entity space.

TransH

TransH projects entities onto a relation-specific hyperplane wrw_r (unit normal vector) before applying translation:

h=hwrThwr(projection of h onto hyperplane)h_\perp = h - w_r^T h \cdot w_r \quad (\text{projection of } h \text{ onto hyperplane}) f(h,r,t)=h+drt2f(h, r, t) = -\|h_\perp + d_r - t_\perp\|^2

Both TransR and TransH improve over TransE on 1-N and N-1 relations but introduce significantly more parameters and computational cost.


DistMult and ComplEx - Bilinear Models

DistMult

DistMult (Yang et al., 2015) uses a bilinear scoring function with a diagonal relation matrix:

f(h,r,t)=hTdiag(r)t=ihiritif(h, r, t) = h^T \text{diag}(r)\, t = \sum_i h_i \cdot r_i \cdot t_i

Each relation is a vector of scaling factors. Efficient and effective, but symmetric by construction: f(h,r,t)=f(t,r,h)f(h, r, t) = f(t, r, h) always. Cannot model antisymmetric relations (parent-of, born-before, etc.).

ComplEx - Complex-Valued Embeddings

Trouillon et al. (2016) extended DistMult to complex numbers. Entities and relations are embedded as h,r,tCdh, r, t \in \mathbb{C}^d. The scoring function uses the real part of the Hermitian dot product:

f(h,r,t)=Re ⁣(hTdiag(r)tˉ)=Re ⁣(ihiritˉi)f(h, r, t) = \text{Re}\!\left(h^T \text{diag}(r)\, \bar{t}\right) = \text{Re}\!\left(\sum_i h_i \cdot r_i \cdot \bar{t}_i\right)

where tˉ\bar{t} is the complex conjugate of tt. The asymmetry arises because hTdiag(r)tˉtTdiag(r)hˉh^T \text{diag}(r) \bar{t} \neq t^T \text{diag}(r) \bar{h} in general - using the conjugate breaks the symmetry constraint of DistMult.

Key insight: complex embeddings have twice the expressive power for the same embedding dimension. A dd-dimensional complex embedding corresponds to 2d2d real parameters, but the complex structure enforces a specific coupling between the paired real dimensions that encodes symmetry/antisymmetry naturally.


RotatE - Relation as Rotation in Complex Space

Sun et al. (2019) unified the treatment of relational patterns by representing each relation as an element-wise rotation in complex space.

The Core Equation

Entities are embedded as h,tCdh, t \in \mathbb{C}^d. Relations are embedded as rCdr \in \mathbb{C}^d with the constraint that each component has unit modulus: ri=1|r_i| = 1 for all ii. This means ri=eiθr,ir_i = e^{i\theta_{r,i}} for some angle θr,i\theta_{r,i}.

The scoring function is:

f(h,r,t)=hrtf(h, r, t) = -\|h \circ r - t\|

where \circ denotes element-wise (Hadamard) product in Cd\mathbb{C}^d. The constraint ri=1|r_i| = 1 ensures rr represents a pure rotation in the 2D complex plane for each dimension.

Geometric Intuition

In each dimension ii, the operation hirih_i \cdot r_i rotates the complex number hih_i by angle θr,i\theta_{r,i}. The relation rr defines a sequence of dd rotations - one per dimension. A valid triple (h,r,t)(h, r, t) means: rotating hh by the relation's rotation pattern should produce tt.

Why RotatE Handles All Relational Patterns

Symmetry (r(h,t)r(t,h)r(h,t) \Rightarrow r(t,h)): when θr,i=0\theta_{r,i} = 0 or π\pi for all ii, the rotation is its own inverse. hr=th \circ r = t and tr=ht \circ r = h - both are true.

Antisymmetry (r(h,t)¬r(t,h)r(h,t) \Rightarrow \neg r(t,h)): when θr,i0,π\theta_{r,i} \neq 0, \pi, rotating hh by θr,i\theta_{r,i} gives tt, but rotating tt by θr,i\theta_{r,i} gives something different. The rotation direction enforces antisymmetry.

Inversion (r1(h,t)r2(t,h)r_1(h,t) \Rightarrow r_2(t,h)): if r2=r1ˉr_2 = \bar{r_1} (complex conjugate - reverse rotation), then hr1=th \circ r_1 = t implies tr2=ht \circ r_2 = h. Inversion is modeled by conjugate relation embeddings.

Composition (r1(h,m)r2(m,t)r3(h,t)r_1(h,m) \wedge r_2(m,t) \Rightarrow r_3(h,t)): if hr1=mh \circ r_1 = m and mr2=tm \circ r_2 = t, then h(r1r2)=th \circ (r_1 \circ r_2) = t. Composition is modeled by element-wise product of relation vectors - rotation composition.

This unified treatment of all four relational patterns is what makes RotatE superior to all previous geometric models. TransE can handle antisymmetry and composition but fails on symmetry and inversion. ComplEx handles symmetry, antisymmetry, and inversion but not composition (due to the non-associativity of its scoring function). RotatE handles all four.


Self-Adversarial Negative Sampling

Standard negative sampling (replace head or tail with a random entity) produces mostly easy negatives. RotatE introduced self-adversarial negative sampling: sample negatives with probability proportional to their current model score:

p(hj,r,tj)=exp(αf(hj,r,tj))jexp(αf(hj,r,tj))p(h'_j, r, t'_j) = \frac{\exp\left(\alpha\, f(h'_j, r, t'_j)\right)}{\sum_j \exp\left(\alpha\, f(h'_j, r, t'_j)\right)}

where α\alpha is a temperature parameter. High-scoring negatives (the model's current mistakes) are sampled with higher probability. This creates harder negatives as training progresses, dramatically improving convergence and final accuracy.

The RotatE loss with self-adversarial negative sampling:

L=lnσ(γf(h,r,t))jp(hj,r,tj)lnσ ⁣(f(hj,r,tj)γ)\mathcal{L} = -\ln\sigma(\gamma - f(h, r, t)) - \sum_j p(h'_j, r, t'_j)\, \ln\sigma\!\left(f(h'_j, r, t'_j) - \gamma\right)

where γ\gamma is the margin.


ConvE - 2D Convolution for KG Completion

Dettmers et al. (2018) took a radically different approach: instead of geometric operations, use 2D convolution over entity-relation feature maps.

The scoring function:

  1. Reshape and concatenate hh and rr into a 2D feature map
  2. Apply a bank of 2D convolutional filters
  3. Flatten and apply a linear layer
  4. Dot product with tail entity tt

f(h,r,t)=σ ⁣(vec ⁣(f ⁣([hˉ;rˉ]ω))W)tf(h, r, t) = \sigma\!\left(\text{vec}\!\left(f\!\left([\bar{h};\bar{r}] * \omega\right)\right) W\right) t

ConvE is highly parameter-efficient - it achieves strong results with far fewer parameters than TransE or RotatE because convolution creates feature interactions across multiple embedding dimensions simultaneously. However, it lacks the clean geometric interpretation of RotatE.


Evaluation Metrics

Standard evaluation: for each test triple (h,r,t)(h, r, t), corrupt it by replacing the tail entity with every entity in E\mathcal{E}. Rank all entities by score. Report:

Mean Reciprocal Rank (MRR): MRR=1Si=1S1ranki\text{MRR} = \frac{1}{|S|} \sum_{i=1}^{|S|} \frac{1}{\text{rank}_i}

High MRR means true answers consistently rank near the top. MRR is dominated by high-rank answers.

Hits@k: fraction of queries where the true answer ranks in the top-kk: Hits@k=1Si=1S1[rankik]\text{Hits@k} = \frac{1}{|S|} \sum_{i=1}^{|S|} \mathbf{1}[\text{rank}_i \leq k]

Standard values: Hits@1, Hits@3, Hits@10.

Filtered setting: when computing the rank of the true tail, remove all other known true tails from the ranking. This prevents penalizing a model for ranking another valid triple above the test triple.


Benchmark Results - FB15k-237 and WN18RR

FB15k-237 (Freebase subset, 237 relation types, 310K triples) and WN18RR (WordNet, 11 relation types, 93K triples) are the standard benchmarks. WN18RR is dominated by hierarchical relations (hypernym, hyponym, member meronym). FB15k-237 has diverse relation patterns.

ModelFB15k-237 MRRFB15k-237 Hits@10WN18RR MRRWN18RR Hits@10
TransE0.2940.4650.2260.501
DistMult0.2410.4190.4300.490
ComplEx0.2470.4280.4400.510
ConvE0.3250.5010.4300.520
RotatE0.3380.5330.4760.571
CompGCN0.3550.5350.4790.546

RotatE substantially outperforms earlier geometric models on both benchmarks. CompGCN improves further on FB15k-237 by incorporating multi-hop neighborhood structure through graph convolution.


CompGCN - GNNs for Multi-Relational Graphs

Vashishth et al. (2020) introduced CompGCN, which combines graph neural networks with knowledge graph embeddings for multi-relational graph convolution.

The Challenge

Standard GNNs (GCN, GraphSAGE) treat all edges as the same type. KGs have hundreds of relation types - edges are typed, and the relation type fundamentally changes the meaning of a message. You cannot apply the same aggregation to "born in" and "works at" edges as if they were identical.

CompGCN Approach

CompGCN maintains both entity embeddings hvRd\mathbf{h}_v \in \mathbb{R}^d and relation embeddings rRd\mathbf{r} \in \mathbb{R}^d. Messages incorporate the relation type via a composition operation ϕ\phi:

hv(k+1)=f ⁣(W0hv(k)+(u,r)N(v)Wλϕ ⁣(hu(k),r(k)))\mathbf{h}_v^{(k+1)} = f\!\left(W_0\, \mathbf{h}_v^{(k)} + \sum_{(u,r) \in \mathcal{N}(v)} W_\lambda\, \phi\!\left(\mathbf{h}_u^{(k)},\, \mathbf{r}^{(k)}\right)\right)

The composition operation ϕ\phi can be:

  • Subtraction: ϕ(e,r)=er\phi(e, r) = e - r (TransE-inspired)
  • Multiplication: ϕ(e,r)=er\phi(e, r) = e * r (DistMult-inspired)
  • Circular correlation: ϕ(e,r)=er\phi(e, r) = e \star r (ConvE-inspired)

Relation embeddings are updated alongside entity embeddings: r(k+1)=Wrelr(k)\mathbf{r}^{(k+1)} = W_{rel}\, \mathbf{r}^{(k)}

CompGCN Architecture

CompGCN's key insight: pure KGE methods (TransE, RotatE) only look at the direct triple (h,r,t)(h, r, t). They miss the multi-hop relational context - the fact that entity AA participates in 50 different relation types with 200 other entities. GNNs capture this context. CompGCN combines both: GNN layers propagate multi-hop context, KGE decoder exploits geometric structure of the final embeddings.


Full PyTorch RotatE Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Optional


class RotatE(nn.Module):
"""
RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.
Sun et al., ICLR 2019.

Entities: h, t ∈ C^d (stored as real vectors of size 2d)
Relations: r ∈ C^d with |r_i| = 1 (phase angles θ_r, size d)
Score: f(h, r, t) = -‖h ∘ r - t‖

Implementation stores:
- entity embeddings: real+imaginary parts, shape [n_entities, 2*embed_dim]
- relation embeddings: phase angles θ, shape [n_relations, embed_dim]
→ r_i = exp(i·θ_i) = (cos θ_i, sin θ_i)
"""

def __init__(
self,
n_entities: int,
n_relations: int,
embed_dim: int = 500,
gamma: float = 12.0, # margin
adversarial_temp: float = 1.0,
reg_lambda: float = 0.0,
):
super().__init__()
self.n_entities = n_entities
self.n_relations = n_relations
self.embed_dim = embed_dim
self.gamma = gamma
self.adversarial_temp = adversarial_temp
self.reg_lambda = reg_lambda

# Entity embeddings: real and imaginary parts
self.entity_embedding = nn.Embedding(n_entities, 2 * embed_dim)

# Relation embeddings: phase angles (no unit norm constraint needed -
# we use sin/cos to enforce it implicitly)
self.relation_embedding = nn.Embedding(n_relations, embed_dim)

self._init_weights()

def _init_weights(self):
"""Initialize uniformly in [-1, 1] for entities, [0, 2π] for relations."""
nn.init.uniform_(self.entity_embedding.weight, -1.0, 1.0)
nn.init.uniform_(self.relation_embedding.weight, 0, 2 * np.pi)

def _get_complex_embeddings(self, entity_ids: torch.Tensor):
"""
Returns (real, imag) parts of entity embeddings.
entity_ids: [B]
Returns: real [B, d], imag [B, d]
"""
emb = self.entity_embedding(entity_ids) # [B, 2d]
real = emb[:, :self.embed_dim] # [B, d]
imag = emb[:, self.embed_dim:] # [B, d]
return real, imag

def _get_relation_rotation(self, relation_ids: torch.Tensor):
"""
Returns (cos θ, sin θ) for each relation dimension.
Enforces |r_i| = 1 by construction.
relation_ids: [B]
Returns: cos_r [B, d], sin_r [B, d]
"""
theta = self.relation_embedding(relation_ids) # [B, d]
cos_r = torch.cos(theta) # [B, d]
sin_r = torch.sin(theta) # [B, d]
return cos_r, sin_r

def score(
self,
head_ids: torch.Tensor,
rel_ids: torch.Tensor,
tail_ids: torch.Tensor,
) -> torch.Tensor:
"""
Compute RotatE score f(h, r, t) = -‖h ∘ r - t‖.
Positive triples get high (less negative) scores.

Returns: [B] scores
"""
h_re, h_im = self._get_complex_embeddings(head_ids)
t_re, t_im = self._get_complex_embeddings(tail_ids)
cos_r, sin_r = self._get_relation_rotation(rel_ids)

# Complex multiplication: (h_re + i·h_im) * (cos_r + i·sin_r)
# = (h_re·cos_r - h_im·sin_r) + i·(h_re·sin_r + h_im·cos_r)
rot_re = h_re * cos_r - h_im * sin_r # [B, d]
rot_im = h_re * sin_r + h_im * cos_r # [B, d]

# Distance: ‖h ∘ r - t‖
diff_re = rot_re - t_re # [B, d]
diff_im = rot_im - t_im # [B, d]

# L2 norm of complex difference
norm = torch.sqrt(diff_re.pow(2) + diff_im.pow(2) + 1e-9) # [B, d]
return -norm.sum(dim=-1) # [B]

def forward(
self,
pos_heads: torch.Tensor,
pos_rels: torch.Tensor,
pos_tails: torch.Tensor,
neg_heads: torch.Tensor,
neg_rels: torch.Tensor,
neg_tails: torch.Tensor,
) -> torch.Tensor:
"""
Compute RotatE loss with self-adversarial negative sampling.

pos_*: [B] - positive triple components
neg_*: [B, K] - K negative samples per positive triple
Returns: scalar loss
"""
batch_size = pos_heads.size(0)
n_neg = neg_heads.size(1)

# Positive scores
pos_scores = self.score(pos_heads, pos_rels, pos_tails) # [B]

# Negative scores: reshape to [B*K] for batch scoring
neg_h = neg_heads.view(-1)
neg_r = neg_rels.view(-1)
neg_t = neg_tails.view(-1)
neg_scores = self.score(neg_h, neg_r, neg_t).view(batch_size, n_neg) # [B, K]

# Self-adversarial weights: p ∝ exp(α·score)
with torch.no_grad():
neg_weights = F.softmax(
self.adversarial_temp * neg_scores, dim=-1
) # [B, K]

# RotatE loss (equation from paper)
pos_loss = -F.logsigmoid(self.gamma + pos_scores).mean()
neg_loss = -(neg_weights * F.logsigmoid(
-(self.gamma + neg_scores)
)).sum(dim=-1).mean()

loss = (pos_loss + neg_loss) / 2

# L2 regularization on entity embeddings
if self.reg_lambda > 0:
reg = (
self.entity_embedding(pos_heads).pow(2).sum() +
self.entity_embedding(pos_tails).pow(2).sum()
) / (2 * batch_size)
loss = loss + self.reg_lambda * reg

return loss


# ──────────────────────────────────────────────
# Dataset and Training
# ──────────────────────────────────────────────

class KGDataset(torch.utils.data.Dataset):
"""Simple KG dataset returning (head, relation, tail) triples."""

def __init__(self, triples: np.ndarray, n_entities: int, n_neg: int = 256):
self.triples = triples # [N, 3] - head, rel, tail
self.n_entities = n_entities
self.n_neg = n_neg

def __len__(self):
return len(self.triples)

def __getitem__(self, idx):
h, r, t = self.triples[idx]
# Generate negative samples by corrupting head or tail
neg_h = np.random.randint(0, self.n_entities, self.n_neg)
neg_t = np.random.randint(0, self.n_entities, self.n_neg)
# Alternate: corrupt head for half, tail for other half
mask = np.random.random(self.n_neg) < 0.5
neg_heads = np.where(mask, neg_h, h)
neg_tails = np.where(mask, t, neg_t)
return {
'pos_head': h, 'pos_rel': r, 'pos_tail': t,
'neg_heads': neg_heads,
'neg_rels': np.full(self.n_neg, r),
'neg_tails': neg_tails,
}


def train_rotate(
train_triples: np.ndarray,
n_entities: int,
n_relations: int,
embed_dim: int = 500,
gamma: float = 12.0,
epochs: int = 500,
batch_size: int = 1024,
lr: float = 1e-3,
n_neg: int = 256,
):
"""Train RotatE on FB15k-237 or WN18RR."""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

dataset = KGDataset(train_triples, n_entities, n_neg)
loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4
)

model = RotatE(
n_entities=n_entities,
n_relations=n_relations,
embed_dim=embed_dim,
gamma=gamma,
adversarial_temp=1.0,
).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.5)

for epoch in range(epochs):
model.train()
total_loss = 0.0
for batch in loader:
pos_h = batch['pos_head'].long().to(device)
pos_r = batch['pos_rel'].long().to(device)
pos_t = batch['pos_tail'].long().to(device)
neg_h = batch['neg_heads'].long().to(device)
neg_r = batch['neg_rels'].long().to(device)
neg_t = batch['neg_tails'].long().to(device)

loss = model(pos_h, pos_r, pos_t, neg_h, neg_r, neg_t)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()

scheduler.step()
if epoch % 50 == 0:
print(f"Epoch {epoch:4d} Loss: {total_loss/len(loader):.4f}")

return model


# ──────────────────────────────────────────────
# Evaluation: MRR and Hits@k
# ──────────────────────────────────────────────

@torch.no_grad()
def evaluate_rotate(
model: RotatE,
test_triples: np.ndarray,
all_true_triples: set, # (h, r, t) tuples - for filtered evaluation
device: str = 'cpu',
batch_size: int = 512,
) -> dict:
"""
Evaluate RotatE with filtered MRR and Hits@k.
For each test triple (h, r, t), rank all entities as tail candidates.
"""
model.eval()
model = model.to(device)

n_entities = model.n_entities
all_ranks = []

for i in range(0, len(test_triples), batch_size):
batch = test_triples[i:i+batch_size]
for h, r, t in batch:
# Score all entities as tail
h_ids = torch.full((n_entities,), h, dtype=torch.long, device=device)
r_ids = torch.full((n_entities,), r, dtype=torch.long, device=device)
t_ids = torch.arange(n_entities, dtype=torch.long, device=device)

scores = model.score(h_ids, r_ids, t_ids).cpu().numpy() # [n_entities]

# Filter: mask out other known true tails
true_tail_score = scores[t]
for e in range(n_entities):
if e != t and (h, r, e) in all_true_triples:
scores[e] = -1e9

# Rank of true tail (1-indexed)
rank = int((scores > true_tail_score).sum()) + 1
all_ranks.append(rank)

ranks = np.array(all_ranks)
return {
'MRR': float(np.mean(1.0 / ranks)),
'Hits@1': float(np.mean(ranks <= 1)),
'Hits@3': float(np.mean(ranks <= 3)),
'Hits@10': float(np.mean(ranks <= 10)),
}

Knowledge-Enhanced Language Models

Pure language models learn world knowledge from text implicitly - expensive and unreliable for structured facts. Knowledge graph embeddings enable explicit knowledge injection into language models.

ERNIE (Tsinghua, 2019): integrates entity embeddings from a KG into BERT's token representations. When BERT processes "Paris," the ERNIE layer fuses the token representation with the entity embedding for the Paris entity in the KG. This gives the model access to structured knowledge about Paris (country, population, coordinates, famous landmarks) without requiring that knowledge to be memorized in the transformer weights.

KEPLER (2021): jointly trains KG embedding and language model objectives on the same model. The encoder produces entity representations, and a RotatE-style decoder is applied on top of these representations to predict KG triples. The model simultaneously learns to understand text and to complete the knowledge graph.

K-BERT (2020): injects structured KG triplets as additional tokens into the input sequence before feeding to BERT. If the input is "Paris is beautiful," K-BERT might expand it to "Paris [capital of] France [located in] Europe is beautiful" using the KG. This enriches the context window with structured knowledge.


Applications in Production

Google Knowledge Panel

When you search for a person, place, or concept, Google's Knowledge Panel shows structured facts: birth date, nationality, occupation, related entities. These are pulled from the Google Knowledge Graph (500B facts). Link prediction powers the "People also searched for" and entity disambiguation features - when the search query matches multiple entities, KG embeddings help identify the most likely intended entity.

Drug-Protein Interaction Prediction

Biomedical KGs (DrugBank, STRING, OMIM) encode known interactions between drugs, proteins, diseases, and genes. Missing interactions (drug-protein pairs that have not been experimentally tested) can be predicted by RotatE or ComplEx on the biomedical KG. This is used to prioritize which drug-protein pairs to test in expensive wet lab assays - reducing the experimental search space from millions of possible interactions to a manageable shortlist.

A concrete example: predict whether a drug molecule binds to a specific protein target by finding entities in embedding space that satisfy the (drug,binds to,protein)(drug, \text{binds to}, protein) triple pattern, even if that specific drug-protein interaction was never observed in training data.

Knowledge Graph Question Answering (KGQA)

Multi-hop KGQA requires following chains of reasoning: "Who is the spouse of the president who won the 2004 election?" requires knowing (Bush, winner of, 2004 election), then (Bush, spouse, Laura Bush). KG embeddings support this via relation composition: hr1r2?h \circ r_1 \circ r_2 \approx ?. Systems like NSM (Nested Search Module) and EmbedKGQA use KG embeddings as the core reasoning component.


Wikidata at Scale

Wikidata (100M+ entities, 1B+ statements) is the largest freely available structured knowledge base. Training RotatE on full Wikidata is non-trivial:

  • Scale: entity embedding table alone is 100M×2d100M \times 2d floats. At d=500d=500: 100B parameters just for entity embeddings. Not feasible for a single machine.
  • Solution: distributed training across multiple machines with entity embeddings sharded across workers. Frameworks like PyKEEN support this via embedding sharding.
  • Hierarchical relations: Wikidata has deep class hierarchies ("dog" is a "mammal" is an "animal"). TransE performs poorly on hierarchical relations because it cannot model the tree structure; hyperbolic embeddings (Poincaré) are better suited.
  • Training subsets: in practice, researchers use Wikidata subsets filtered to the most common entity types and relation types (e.g., WikiKG90Mv2 benchmark: 90M entities, 1335 relations).

YouTube Resources

TitleChannelFocus
Knowledge Graph Embeddings (CS224W)Stanford OnlineTransE · RotatE · link prediction theory
RotatE Paper ExplainedYannic KilcherRotatE math · complex rotation intuition
Knowledge Graphs and EmbeddingsUC Berkeley EECSKG structure · completion tasks · benchmark analysis
CompGCN: Composition-based GNNs for KGsGraph ML CommunityCompGCN architecture · multi-relational convolution
Knowledge Graph Enhanced NLPWeights and BiasesERNIE · K-BERT · knowledge injection into LMs

Common Pitfalls

:::danger Evaluating on FB15k (not FB15k-237) gives inflated results FB15k (the original Freebase benchmark from the TransE paper) has a critical flaw: many test triples can be trivially answered by inverting a training triple. If you see (h,r,t)(h, r, t) in training, the test triple (t,r1,h)(t, r^{-1}, h) is trivially answered by reversing. FB15k-237 removes this by filtering near-inverse triples. Always use FB15k-237 and WN18RR for fair comparisons. A model that beats state-of-the-art on FB15k may not improve at all on FB15k-237. :::

:::danger Not using the filtered evaluation setting In the filtered evaluation setting, when computing the rank of true tail tt for query (h,r,?)(h, r, ?), you mask out all other known true tails. Without filtering, your model gets penalized for predicting (h,r,t)(h, r, t') where (h,r,t)(h, r, t') is also a true triple - the model is correct but gets a bad rank. Always report filtered MRR and filtered Hits@k. Raw (unfiltered) metrics are meaningless for comparison. :::

:::warning TransE for symmetric relations always gives near-zero relation vectors If your KG has many symmetric relations and you use TransE, expect those relation vectors to collapse toward zero during training. The optimization forces h+r=th + r = t and t+r=ht + r = h simultaneously, so r0r \approx 0. The model "learns" these relations but with useless embeddings. Check relation embedding norms during training - if any are near zero, you have a symmetry problem. Switch to RotatE or ComplEx for KGs with symmetric relations. :::

:::warning Large entity embedding tables require careful memory management At d=500d = 500 dimensions, storing entity embeddings for 100K entities requires 100K×1000×4100K \times 1000 \times 4 bytes = 400 MB. For 1M entities: 4 GB just for entity embeddings. Standard practice: use 16-bit floats (half precision) to halve memory, use gradient checkpointing during backprop, and shard the embedding table across GPUs if needed. Many researchers underestimate this and run out of VRAM during training. :::


Interview Q&A

Q: Why does TransE fail for symmetric relations like "sibling of" or "married to"? Give the mathematical argument.

A: TransE models the constraint h+rth + r \approx t for valid triples. For a symmetric relation rr, both (A,r,B)(A, r, B) and (B,r,A)(B, r, A) must be true. TransE requires A+rBA + r \approx B (from the first triple) and B+rAB + r \approx A (from the second). Adding both constraints: A+r+B+rB+AA + r + B + r \approx B + A, which simplifies to 2r02r \approx 0, forcing r0r \approx 0. A zero relation vector means the model cannot distinguish this relation from any other relation that also has r0r \approx 0, and the scoring function reduces to ht-\|h - t\| - completely independent of the relation type. For a KG with many siblings, everyone ends up with nearly identical embeddings. In practice, TransE achieves near-zero performance on symmetric relations while still performing reasonably on antisymmetric ones - you can diagnose this by checking per-relation type MRR in your evaluation.

Q: Explain the RotatE rotation intuition. Why does representing relations as rotations handle composition naturally?

A: RotatE embeds entities in Cd\mathbb{C}^d and represents each relation as an element-wise rotation: ri=eiθr,ir_i = e^{i\theta_{r,i}} (unit complex number). A valid triple (h,r,t)(h, r, t) satisfies hr=th \circ r = t - rotating hh by angle θr\theta_r in each dimension lands on tt. Composition is natural because successive rotations compose by multiplication: if (h,r1,m)(h, r_1, m) is true (hr1=mh \circ r_1 = m) and (m,r2,t)(m, r_2, t) is true (mr2=tm \circ r_2 = t), then hr1r2=mr2=th \circ r_1 \circ r_2 = m \circ r_2 = t - the composed relation is simply the element-wise product of the two rotation vectors. Inversion is modeled by the complex conjugate: if hr=th \circ r = t, then trˉ=ht \circ \bar{r} = h (rotating back by the same angle). Symmetry is when θr,i=0\theta_{r,i} = 0 or π\pi (rotation by 0 or 180 degrees is its own inverse). This geometric framework unifies all four relation patterns in a single parameterization.

Q: What is the filtered evaluation setting for KG link prediction, and why does it matter?

A: In KG link prediction, for each test triple (h,r,t)(h, r, t), you rank all entities eEe \in \mathcal{E} by score f(h,r,e)f(h, r, e) and measure how high tt ranks. The problem: many other entities tt' may also be valid answers - (h,r,t)(h, r, t') may be a true triple in the train or test set. In the unfiltered (raw) setting, if your model ranks a valid triple (h,r,t)(h, r, t') above the test triple (h,r,t)(h, r, t), you get penalized - but the model is technically correct. The filtered setting removes all known true triples from the ranking before computing the rank of the test triple. This prevents unfair penalization and gives a cleaner measure of the model's ability to predict missing facts specifically. Always use filtered MRR and Hits@k for reporting and comparison. The gap between raw and filtered MRR can be substantial (often 0.05-0.15 MRR points) on dense KGs where many entities share the same relation to the head.

Q: How does CompGCN differ from simply running a GNN on a KG? What is the composition operation and why is it needed?

A: Standard GNNs (GCN, GAT) treat all edges as the same type. In a KG with 237 relation types, message passing that ignores the relation type would conflate "born in," "works at," and "founded by" - destroying the meaning of the multi-relational graph. CompGCN addresses this with a composition operation ϕ(hu,r)\phi(h_u, r) that combines the neighbor entity embedding with the relation embedding before aggregation: hv=f(W0hv+(u,r)Wλϕ(hu,r))h_v' = f(W_0 h_v + \sum_{(u,r)} W_\lambda \phi(h_u, r)). The composition ϕ\phi can be subtraction (hurh_u - r, inspired by TransE), multiplication (hurh_u * r, inspired by DistMult), or circular correlation. This allows different relation types to contribute different messages to the same target node. CompGCN also updates relation embeddings: r=Wrelrr' = W_{rel} r, so relations are refined by the graph convolution just like entities. The KGE decoder (RotatE, ConvE, etc.) is then applied to the final entity embeddings - benefiting from both multi-hop relational context (from GNN layers) and geometric structure (from the KGE scoring function).

Q: Distinguish knowledge graph completion from knowledge graph construction. When would you use each?

A: Knowledge graph completion (KGC) assumes you have an existing KG and want to predict missing facts (link prediction, triple classification). You use KGE models (TransE, RotatE, CompGCN) trained on the existing triples. The model generalizes by finding entities whose embeddings satisfy the geometric constraints implied by the relation, even if the specific triple was never observed. Knowledge graph construction starts from scratch - extracting entities and relations from text, web pages, tables, or structured sources. Techniques include named entity recognition (NER), relation extraction (RE), entity linking, and coreference resolution. You use KGC after construction - once you have a KG from extraction, it will be incomplete, and KGC fills the gaps. In practice: build a KG from your domain corpus using NLP extraction pipelines, then train RotatE or CompGCN on the resulting triples to predict missing links. For biomedical KG construction from literature: use BioBERT-based RE to extract drug-protein interactions from abstracts, build the KG, then use RotatE to predict interactions that were never mentioned in any paper but are implied by the embedding geometry.

Q: How would you use knowledge graph embeddings in a production question answering system?

A: A practical KGQA pipeline has three stages. First, entity linking: identify entities in the user question and link them to KG entities. "Who directed the film that won Best Picture in 1994?" links "Best Picture 1994" to the Forrest Gump entity in the KG. Second, relation path finding: use the KG structure and relation embeddings to find relevant reasoning paths. For single-hop questions, retrieve entities connected by the queried relation. For multi-hop, compose relation embeddings: hr1r2h \circ r_1 \circ r_2 to score entities reachable in two steps. Third, answer ranking: score candidate answers using the KGE model and return the top-ranked entity. For production: precompute entity embeddings, store in FAISS or a vector database. At query time, compose query embedding (hrh \circ r for the inferred relation), retrieve nearest entities from FAISS, return top result. Key engineering challenge: relation inference from natural language is hard - the model must map "directed" to the director_of relation, "won Best Picture in 1994" to a specific entity, and combine them correctly. Entity disambiguation and relation extraction quality are the bottlenecks, not the KGE scoring function.


Beyond RotatE - Recent Advances

Temporal Knowledge Graph Embeddings

Real-world facts have timestamps: (Albert Einstein,workAt,Princeton,[1933,1955])(Albert\ Einstein, \text{workAt}, Princeton, [1933, 1955]). Standard KGE models treat all facts as time-independent. Temporal KGE extends the scoring function to include time:

f(h,r,t,τ)=score(h,r(τ),t)f(h, r, t, \tau) = \text{score}(h, r(\tau), t)

where r(τ)r(\tau) is a time-dependent relation embedding. ATiSE uses Gaussian distributions over time to model temporal uncertainty. TComplEx extends ComplEx to 4D tensors with a time dimension. For biomedical KGs where drug approvals, clinical trial results, and disease associations have timestamps, temporal KGE dramatically improves prediction of future interactions.

Hyperbolic Knowledge Graph Embeddings

Many KG relations are hierarchical (is-a, subclass-of, part-of). Euclidean space distorts hierarchical structure - representing a tree in Rd\mathbb{R}^d requires exponential space. Hyperbolic space (Poincaré ball model) grows exponentially with radius, matching the exponential branching of trees.

MuRP (2019): extends TransE to hyperbolic space. Entities near the origin represent general concepts; entities near the boundary represent specific instances. The translation operation becomes Möbius addition in the Poincaré ball:

f(h,r,t)=dH(hr,t)f(h, r, t) = -d_{\mathbb{H}}(h \oplus r,\, t)

where \oplus is Möbius addition and dHd_{\mathbb{H}} is the Poincaré distance. MuRP achieves state-of-the-art on WN18RR (which has deep "hypernym" hierarchies) with significantly lower embedding dimension than Euclidean models.

Reasoning Chains - Multi-Hop Queries

Single-hop link prediction answers (h,r,?)(h, r, ?). Real questions require multi-hop reasoning: "What is the nationality of the spouse of the person who founded Apple?" requires following a chain (Steve Jobs,spouseOf,?)(Steve\ Jobs, \text{spouseOf}, ?), then (?,nationality,?)(?, \text{nationality}, ?).

Query2Box (2020): represents conjunctive logical queries as boxes in embedding space - the answer entities should be inside the box. Set intersection is modeled as box intersection. BetaE (2020): represents queries as Beta distributions, enabling reasoning under uncertainty.

These methods extend KGE from simple triple completion to general multi-hop query answering over incomplete KGs.


KGE Model Selection Guide

Quick selection rules:

  • Default choice for most KGs with mixed relation types: RotatE
  • Deep taxonomic hierarchies (WordNet-style): MuRP or RotH
  • KG with temporal data: TComplEx or ATiSE
  • Need multi-hop reasoning, not just link prediction: Query2Box or BetaE
  • Need GNN-style context aggregation: CompGCN with RotatE decoder
  • Very large KG (100M+ entities) with memory constraints: DistMult (4x cheaper than RotatE) or RotatE with half-precision embeddings and sharded training

Production Engineering Notes

Serving KGE at Scale

Entity embedding tables for production KGs are large. For a KG with 5M entities at d=500d=500 (RotatE stores 2d2d floats per entity): 5M×1000×45M \times 1000 \times 4 bytes = 20 GB. This does not fit in a single GPU's memory.

Embedding sharding: partition entity embeddings across multiple machines. During training, mini-batches that reference entities on different shards trigger network communication. Libraries like PyKEEN and DGL-KE support multi-GPU training with embedding sharding.

Approximate nearest neighbor retrieval: for (h,r,?)(h, r, ?) queries at inference, you need to find the entity tt^* that maximizes f(h,r,t)f(h, r, t). Brute force over all 5M entities is slow. Since RotatE's scoring function is approximately f(h,r,t)=hrtf(h, r, t) = -\|h \circ r - t\|, the query vector is q=hrq = h \circ r and you want the nearest entity to qq in embedding space. Use FAISS with the query vector and retrieve top-K candidates, then rerank with the exact RotatE score.

Caching hot entities: in practice, a small fraction of entities (popular countries, common occupations, frequent drug targets) appear in the vast majority of queries. Cache their embeddings in fast memory (Redis, in-process dict) for zero-latency retrieval.

import faiss
import numpy as np

def build_kge_faiss_index(entity_embeddings: np.ndarray, use_gpu: bool = True):
"""
Build FAISS index over entity embeddings for fast KGE retrieval.
entity_embeddings: [n_entities, 2*embed_dim] (RotatE real + imaginary)
"""
d = entity_embeddings.shape[1]

# Normalize for cosine similarity (RotatE score ~ -L2 distance in rotated space)
faiss.normalize_L2(entity_embeddings)

# IVF index for large entity sets
n_clusters = min(int(np.sqrt(len(entity_embeddings))), 4096)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, n_clusters, faiss.METRIC_L2)

if use_gpu:
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, index)

index.train(entity_embeddings)
index.add(entity_embeddings)
index.nprobe = 64 # search 64 clusters per query (tradeoff: speed vs recall)
return index


def kge_link_prediction_faiss(
model: RotatE,
head_id: int,
rel_id: int,
faiss_index,
top_k: int = 10,
device: str = 'cpu',
):
"""
Fast link prediction: (head, rel, ?) using FAISS ANN.
1. Compute query = h ∘ r (RotatE rotation)
2. Search FAISS for nearest entity embeddings
3. Rerank top-K with exact RotatE score
"""
model.eval()
with torch.no_grad():
h_id = torch.tensor([head_id], device=device)
r_id = torch.tensor([rel_id], device=device)

# Get real/imaginary parts of head
h_re, h_im = model._get_complex_embeddings(h_id)
cos_r, sin_r = model._get_relation_rotation(r_id)

# Rotated query vector: h ∘ r
q_re = (h_re * cos_r - h_im * sin_r).cpu().numpy() # [1, d]
q_im = (h_re * sin_r + h_im * cos_r).cpu().numpy() # [1, d]
query = np.concatenate([q_re, q_im], axis=-1) # [1, 2d]

faiss.normalize_L2(query)

# ANN search
distances, candidate_ids = faiss_index.search(query, top_k * 5)
candidates = candidate_ids[0]

# Exact rerank with RotatE score
with torch.no_grad():
h_ids = torch.tensor([head_id] * len(candidates), device=device)
r_ids = torch.tensor([rel_id] * len(candidates), device=device)
t_ids = torch.tensor(candidates.tolist(), device=device)
scores = model.score(h_ids, r_ids, t_ids).cpu().numpy()

ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [(entity_id, score) for entity_id, score in ranked[:top_k]]

Summary

Knowledge graph embeddings solve a fundamental problem: real-world knowledge graphs contain hundreds of billions of facts but are fundamentally incomplete. By embedding entities and relations in geometric spaces where valid triples satisfy measurable constraints, KGE models predict missing facts with high accuracy.

TransE established the translation-based paradigm but fails on symmetric and 1-N relations. ComplEx extended the approach to complex-valued embeddings, handling asymmetric relations. RotatE unified all four relational patterns - symmetry, antisymmetry, inversion, and composition - by treating relations as rotations in complex space. Self-adversarial negative sampling further improved training by focusing on the model's hardest mistakes.

CompGCN combined the best of both worlds: GNN-based multi-hop context aggregation over the multi-relational graph, feeding into a KGE decoder for triple scoring. This architecture achieves state-of-the-art on FB15k-237 by leveraging both local relational structure and the global embedding geometry.

Beyond the core models, the field continues to evolve: temporal KGE handles time-stamped facts, hyperbolic embeddings handle deep hierarchical relations, and multi-hop query embedding methods (Query2Box, BetaE) extend KGE from simple triple completion to general logical query answering over incomplete knowledge graphs.

In production, KGE models power Google's Knowledge Graph completions, biomedical drug-protein interaction prediction, and knowledge-enhanced language models. The combination of explicit structured knowledge from KGE with implicit parametric knowledge from language models is an active research frontier - one of the most promising paths toward AI systems that reason reliably about structured world knowledge.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.