Skip to main content

GNNs for Recommender Systems

The Real Interview Moment

"Pinterest has 300 million users, 3 billion pins, and hundreds of billions of engagement signals. Their recommendation system runs in real time. How would you architect a GNN-based recommender for this scale?"

This is not hypothetical. In 2018, the Pinterest engineering team published PinSage - a GNN trained on a graph with 3 billion nodes and 18 billion edges, deployed in production to serve 300 million users. It was the first industrial-scale application of graph neural networks, and it forced the field to think seriously about what it actually takes to deploy GNNs beyond benchmark datasets with a few thousand nodes.

The fundamental insight behind GNN-based recommendation is simple but powerful. Traditional matrix factorization (MF) treats each user and item as an independent embedding to be learned. It captures first-order signals - user uu likes item ii because they interacted with it. But it misses multi-hop signals: user uu likes item ii because user uu liked item jj, and users who liked jj also liked ii, which is similar to kk, which shares properties with ii. This is the collaborative filtering signal that made Netflix Prize contestants' hair turn grey - it's real, it matters, and matrix factorization only captures it indirectly through the shared latent space.

A graph neural network on the user-item interaction graph propagates information directly along these multi-hop paths. After kk layers, a user embedding aggregates information from kk-hop neighborhood - including items-of-items-of-users and users-of-items-of-items. This is structured, principled multi-hop collaborative filtering, not a side effect of matrix factorization's optimization.

The engineering challenge is that user-item graphs are enormous, sparse, and dynamic. New users arrive with no interaction history. New items are added continuously. The full graph never fits in GPU memory. Building a GNN-based recommender that works in this environment requires solving problems that benchmark papers routinely ignore.


Why This Exists - The Limits of Matrix Factorization

Matrix factorization (MF) decomposes the user-item interaction matrix RRM×NR \in \mathbb{R}^{M \times N} (M users, N items) into user embeddings PRM×dP \in \mathbb{R}^{M \times d} and item embeddings QRN×dQ \in \mathbb{R}^{N \times d}, predicting interaction r^ui=puTqi\hat{r}_{ui} = p_u^T q_i.

What MF gets right:

  • Learns compressed representations of users and items
  • Captures user taste clusters through the shared latent space
  • Scalable: O(M+N)O(M + N) parameters

What MF misses:

  • Multi-hop signals: MF does not explicitly propagate the "user A liked item B which was also liked by users similar to A" signal
  • Graph structure: the user-item graph has rich topology - hub items (popular), niche items (few interactions but dedicated fans), user clusters - MF sees only the interaction matrix
  • Higher-order connectivity: items are similar not just through shared users but through chains of shared users through other items

The classical symptom: MF performs well on popular items but poorly on long-tail items, because long-tail items have few direct interactions. A GNN can propagate information from structurally similar items even without direct co-interaction.


Historical Context

The sequence of GNN-based recommenders follows a clear progression:

YearPaperKey Idea
2018PinSage (Ying et al., Pinterest)First billion-scale GNN in production; random walk neighborhood sampling
2019NGCF (Wang et al., SIGIR)Explicit embedding propagation on user-item graph; interaction-based messages
2020LightGCN (He et al., SIGIR)Remove feature transform + nonlinearity - just weighted sum; surprisingly beats NGCF
2020DGCF (Wang et al., SIGIR)Disentangled intent-based propagation
2021SimGCL (Yu et al.)Contrastive learning augmentation for GNN-based recsys
2022UltraGCN (Mao et al.)Approximate LightGCN without explicit message passing

PinSage solved the engineering problem. NGCF showed that multi-hop propagation beats pure MF. LightGCN showed that simpler is better - a result that resonated through the entire field.


Graph Construction from Implicit Feedback

In real systems, you rarely have explicit ratings (1-5 stars). You have implicit feedback: clicks, purchases, watch time, saves, repins.

Bipartite Graph Construction

Build a bipartite graph G=(UI,E)G = (U \cup I, E) where:

  • UU = user nodes
  • II = item nodes
  • EE = interaction edges ((u,i)E(u, i) \in E if user uu interacted with item ii)

No edges exist between user-user or item-item pairs in the raw bipartite graph. Message passing on this graph naturally enables multi-hop collaborative filtering:

  • 1-hop from user uu: items uu interacted with
  • 2-hop from user uu: users who interacted with the same items as uu (similar users)
  • 3-hop from user uu: items that similar users interacted with (collaborative filtering signal)

Multi-hop intuition: User A and User B both interacted with Item 2. That makes User A and User B structurally similar. User B also interacted with Item 3. Therefore Item 3 is likely relevant to User A - even though User A has never seen Item 3. This is precisely the 3-hop collaborative filtering signal a GNN captures by propagating over the bipartite graph.

Negative Sampling

You need negative examples (user-item pairs the user did not interact with) to train a ranking model. Several strategies:

Uniform random sampling: sample negative item uniformly at random from all non-interacted items. Simple but low signal - most negative samples are obviously irrelevant.

Popularity-biased sampling: sample negative items proportional to their interaction count p(i)niαp(i^-) \propto n_i^\alpha where α[0,1]\alpha \in [0,1] and nin_i is item interaction count. Harder negatives: popular items a user didn't interact with despite having opportunity.

Hard negative sampling: mine items with high predicted scores but no interaction - the model's current mistakes. Most informative, but requires periodic recomputation and risks training on false negatives (items the user would like but simply hasn't seen).

PinSage curriculum training: start with easy negatives, progressively introduce harder negatives as training proceeds. Curriculum learning stabilizes training and improves final performance.


NGCF - Neural Graph Collaborative Filtering

Wang et al. (2019) introduced NGCF as the first GNN specifically designed for collaborative filtering with explicit interaction-based messages.

NGCF Message Function

For a user-item edge (u,i)(u, i), NGCF computes messages using an element-wise product:

mui=1NuNi(W1hi+W2(hihu))m_{u \leftarrow i} = \frac{1}{\sqrt{|\mathcal{N}_u| \cdot |\mathcal{N}_i|}} \left(W_1 h_i + W_2 (h_i \odot h_u)\right)

The W2(hihu)W_2 (h_i \odot h_u) term is the key innovation: the message depends on the interaction between the user and item embeddings, not just the item embedding. This allows the model to learn which dimensions of the item embedding are important given the user's current preference state.

NGCF Update

After aggregating messages from all neighbors:

hu(k+1)=LeakyReLU ⁣(W1hu(k)+iNumui(k))h_u^{(k+1)} = \text{LeakyReLU}\!\left(W_1 h_u^{(k)} + \sum_{i \in \mathcal{N}_u} m_{u \leftarrow i}^{(k)}\right)

NGCF Final Prediction

Concatenate embeddings from all KK layers:

eu=hu(0)hu(1)hu(K)e_u^* = h_u^{(0)} \| h_u^{(1)} \| \cdots \| h_u^{(K)}

Predict interaction score: y^ui=(eu)Tei\hat{y}_{ui} = (e_u^*)^T e_i^*

Three propagation layers were found optimal. NGCF beat pure MF and many neural MF variants on standard benchmarks.


LightGCN - Simpler and Better

He et al. (2020) asked a sharp question: NGCF uses feature transformation WW and nonlinear activation at every layer. What happens if we remove them?

The answer was surprising: removing the transformation and activation consistently improved performance.

LightGCN Propagation Rule

LightGCN reduces the MPNN to its bare minimum - just weighted sum aggregation:

eu(k+1)=iNu1NuNiei(k)e_u^{(k+1)} = \sum_{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u||\mathcal{N}_i|}}\, e_i^{(k)}

ei(k+1)=uNi1NiNueu(k)e_i^{(k+1)} = \sum_{u \in \mathcal{N}_i} \frac{1}{\sqrt{|\mathcal{N}_i||\mathcal{N}_u|}}\, e_u^{(k)}

No weight matrix, no activation function. The normalization 1NuNi\frac{1}{\sqrt{|\mathcal{N}_u||\mathcal{N}_i|}} is the symmetric GCN normalization - fixed, not learned.

LightGCN Final Embedding

Combine representations from all KK layers:

eu=k=0Kαkeu(k),αk=1K+1e_u^* = \sum_{k=0}^{K} \alpha_k\, e_u^{(k)}, \quad \alpha_k = \frac{1}{K+1}

Uniform layer combination (can also be learned). Final prediction:

y^ui=(eu)Tei\hat{y}_{ui} = (e_u^*)^T e_i^*

Why Does Removing Complexity Help?

The ablation study in the LightGCN paper is illuminating:

ModelRecall@20NDCG@10
MF-BPR (no GNN)0.04540.0351
NGCF0.05790.0477
NGCF without W2 (no interaction term)0.05670.0458
NGCF without nonlinearity0.05910.0488
NGCF without both (= LightGCN)0.06390.0525

The feature transformation in NGCF introduces redundant parameters that interact poorly with the symmetric normalization. The nonlinearity does not help because the input embeddings (lookup table) are the only parameters being optimized - nonlinearity at every layer overcomplicates the gradient flow. LightGCN strips the architecture to its essential function: propagate embeddings along the graph to inject multi-hop collaborative signals.

Matrix Form of LightGCN

In matrix form, LightGCN is a simple spectral graph convolution:

E(k+1)=A~E(k)E^{(k+1)} = \tilde{A}\, E^{(k)}

where A~\tilde{A} is the symmetrically normalized adjacency matrix of the bipartite graph, and E(k)E^{(k)} contains all user and item embeddings at layer kk. The final embedding is the mean of K+1K+1 layer embeddings:

E=1K+1k=0KA~kE(0)E^* = \frac{1}{K+1} \sum_{k=0}^{K} \tilde{A}^k E^{(0)}

This is a polynomial filter on the graph spectrum - LightGCN is a KK-th degree Chebyshev polynomial filter with uniform coefficients, applied to learned embeddings.


PinSage - Billion-Scale GNN at Pinterest

PinSage (Ying et al., 2018) is the industrial GNN that proved graph-based recommendation works at production scale. It operates on a graph with 3 billion pins (content items), 100+ billion edges, and must produce embeddings for new pins in milliseconds.

Importance-Based Neighborhood Sampling

Standard GNN neighborhood sampling (GraphSAGE-style) samples a fixed number of neighbors uniformly. For billion-scale graphs, this still touches enormous subgraphs. PinSage uses random walk importance sampling:

  1. From node vv, run rr random walks of length \ell
  2. Count how many times each node is visited across all walks: visits(u)\text{visits}(u)
  3. The top-TT most visited nodes define the neighborhood N(v)\mathcal{N}(v)

This is an approximation to Personalized PageRank (PPR) - the stationary distribution of a random walk starting at vv with restart probability α\alpha. PPR naturally assigns higher importance to nodes that are structurally close (nearby in the graph) and popular (many paths lead to them). Using PPR-based neighborhoods instead of random sampling gives better-quality embeddings because high-PPR neighbors are genuinely similar, not just random close-by nodes.

PinSage Architecture

Input: pin content features (image embedding from ResNet, text from TF-IDF)

Layer 1: Importance-weighted aggregation from top-50 neighbors
→ concatenate with self
→ linear + batch norm + ReLU

Layer 2: Same with 2-hop importance-sampled neighborhoods

Output: 256-dimensional L2-normalized pin embedding

PinSage uses content features (image embeddings, text features) as initial node features - not learned lookup embeddings. This enables cold-start: a new pin with no engagement history can still get an embedding from its content features. Contrast with LightGCN, which uses only learned embeddings and cannot handle new items.

MapReduce Pipeline

Training the full model on 3B nodes requires distributed computation:

  1. Offline preprocessing: compute Personalized PageRank neighborhoods for all pins using Apache Spark. Cached to disk.
  2. Mini-batch construction: each training batch samples a set of (pin, positive context, negative) triples. Subgraph is extracted from cached neighbors.
  3. GPU training: mini-batches of subgraphs run on GPU. Model parameters synchronized across workers.
  4. Inference: compute embeddings for all 3B pins in MapReduce - each worker handles a shard, uses the trained GNN to compute embeddings from neighbors' cached embeddings.

Curriculum Training

PinSage trains with progressively harder negatives:

  • Epoch 1-5: random negatives (any non-interacted pin)
  • Epoch 6-10: within-batch hard negatives (highest-scoring negatives from current batch)
  • Epoch 11+: fully mined hard negatives (run inference, find pins closest to query but not interacted with)

This curriculum helps because the model needs to learn basic similarity before it can meaningfully mine hard negatives. Starting with hard negatives from a random model produces noisy gradient signal.


BPR Loss for GNN-Based Recommendation

All three models (NGCF, LightGCN, PinSage) use Bayesian Personalized Ranking (BPR) loss, which directly optimizes the pairwise ranking objective:

For training triple (u,i,j)(u, i, j) where user uu interacted with item ii (positive) but not item jj (negative):

LBPR=(u,i,j)Olnσ ⁣(y^uiy^uj)+λE(0)2\mathcal{L}_{BPR} = \sum_{(u,i,j) \in \mathcal{O}} -\ln\sigma\!\left(\hat{y}_{ui} - \hat{y}_{uj}\right) + \lambda\|E^{(0)}\|^2

σ\sigma is the sigmoid function. The loss pushes y^ui>y^uj\hat{y}_{ui} > \hat{y}_{uj} - the positive item should score higher than the negative. The L2 regularization λE(0)2\lambda\|E^{(0)}\|^2 is applied only to the initial embeddings (the only parameters in LightGCN) to prevent overfitting.

Why BPR over binary cross-entropy? Because recommendation is fundamentally a ranking problem. You do not care about the absolute scores, only the ordering. BPR directly optimizes pairwise ranking and converges faster than BCE for sparse implicit feedback.


Full LightGCN Implementation in PyTorch Geometric

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import LGConv
from torch_geometric.utils import degree
import numpy as np
from typing import Optional


class LightGCN(nn.Module):
"""
LightGCN: Light Graph Convolutional Network for Recommendation.
He et al., SIGIR 2020.

Architecture:
- Learnable user and item embedding table (initial embeddings)
- K rounds of simple weighted sum propagation on bipartite graph
- Final embedding = uniform average of all layer embeddings
- Prediction = dot product of final user and item embeddings
"""

def __init__(
self,
n_users: int,
n_items: int,
embed_dim: int = 64,
n_layers: int = 3,
reg_lambda: float = 1e-4,
):
super().__init__()
self.n_users = n_users
self.n_items = n_items
self.embed_dim = embed_dim
self.n_layers = n_layers
self.reg_lambda = reg_lambda

# Initial embeddings - the ONLY learnable parameters in LightGCN
self.user_embedding = nn.Embedding(n_users, embed_dim)
self.item_embedding = nn.Embedding(n_items, embed_dim)

# LGConv: simple normalized aggregation (no learned weights)
self.convs = nn.ModuleList([LGConv() for _ in range(n_layers)])

self._init_weights()

def _init_weights(self):
nn.init.normal_(self.user_embedding.weight, std=0.1)
nn.init.normal_(self.item_embedding.weight, std=0.1)

def forward(self, edge_index: torch.Tensor):
"""
edge_index: [2, E] - bipartite graph edges.
Convention: user nodes indexed 0..n_users-1,
item nodes indexed n_users..n_users+n_items-1.

Returns:
user_emb: [n_users, embed_dim] - final user embeddings
item_emb: [n_items, embed_dim] - final item embeddings
"""
# Stack initial embeddings: [n_users + n_items, embed_dim]
x = torch.cat([
self.user_embedding.weight,
self.item_embedding.weight,
], dim=0)

# Collect embeddings at each layer
all_embs = [x]

for conv in self.convs:
x = conv(x, edge_index) # normalized sum aggregation
all_embs.append(x)

# Final embedding: uniform mean across all layers (eq. 11 in paper)
stacked = torch.stack(all_embs, dim=1) # [N, K+1, d]
final_emb = stacked.mean(dim=1) # [N, d]

# Split back into user and item embeddings
user_emb = final_emb[:self.n_users] # [n_users, d]
item_emb = final_emb[self.n_users:] # [n_items, d]

return user_emb, item_emb

def predict(self, user_ids, item_ids, edge_index):
"""Predict scores for (user, item) pairs."""
user_emb, item_emb = self.forward(edge_index)
u = user_emb[user_ids] # [B, d]
i = item_emb[item_ids] # [B, d]
return (u * i).sum(dim=-1) # [B]

def bpr_loss(self, users, pos_items, neg_items, user_emb, item_emb):
"""
BPR loss with L2 regularization on initial embeddings.

users: [B] - user indices
pos_items: [B] - positive item indices
neg_items: [B] - negative item indices
"""
u_emb = user_emb[users]
pos_emb = item_emb[pos_items]
neg_emb = item_emb[neg_items]

# Pairwise scores
pos_scores = (u_emb * pos_emb).sum(dim=-1) # [B]
neg_scores = (u_emb * neg_emb).sum(dim=-1) # [B]

# BPR loss
bpr = -F.logsigmoid(pos_scores - neg_scores).mean()

# L2 regularization on INITIAL embeddings (not propagated)
u_init = self.user_embedding(users)
pi_init = self.item_embedding(pos_items)
ni_init = self.item_embedding(neg_items)
reg = self.reg_lambda * (
u_init.norm(2).pow(2) +
pi_init.norm(2).pow(2) +
ni_init.norm(2).pow(2)
) / len(users)

return bpr + reg


# ──────────────────────────────────────────────
# Training loop
# ──────────────────────────────────────────────

def build_bipartite_edge_index(
user_item_pairs: list, # [(user_id, item_id), ...]
n_users: int,
) -> torch.Tensor:
"""
Build bidirectional edge index for bipartite graph.
Item nodes are offset by n_users.
"""
users = [u for u, i in user_item_pairs]
items = [i + n_users for u, i in user_item_pairs]

# Bidirectional: user→item and item→user
src = torch.tensor(users + items, dtype=torch.long)
dst = torch.tensor(items + users, dtype=torch.long)
return torch.stack([src, dst], dim=0) # [2, 2*len(pairs)]


def train_lightgcn(
train_pairs: list,
n_users: int,
n_items: int,
embed_dim: int = 64,
n_layers: int = 3,
epochs: int = 100,
batch_size: int = 2048,
lr: float = 1e-3,
):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LightGCN(n_users, n_items, embed_dim, n_layers).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

edge_index = build_bipartite_edge_index(train_pairs, n_users).to(device)

# Build positive item set per user for negative sampling
user_pos_items = {}
for u, i in train_pairs:
user_pos_items.setdefault(u, set()).add(i)

train_users = torch.tensor([u for u, _ in train_pairs], dtype=torch.long)
train_items = torch.tensor([i for _, i in train_pairs], dtype=torch.long)

best_loss = float('inf')
for epoch in range(epochs):
model.train()
# Shuffle training pairs
perm = torch.randperm(len(train_pairs))
train_users = train_users[perm]
train_items = train_items[perm]

total_loss = 0.0
n_batches = 0

for start in range(0, len(train_pairs), batch_size):
end = min(start + batch_size, len(train_pairs))
batch_users = train_users[start:end].to(device)
batch_pos = train_items[start:end].to(device)

# Popularity-biased negative sampling (simple version)
batch_neg = torch.randint(0, n_items, (end - start,), device=device)

# Forward pass
user_emb, item_emb = model(edge_index)

loss = model.bpr_loss(batch_users, batch_pos, batch_neg,
user_emb, item_emb)
optimizer.zero_grad()
loss.backward()
optimizer.step()

total_loss += loss.item()
n_batches += 1

avg_loss = total_loss / n_batches
if avg_loss < best_loss:
best_loss = avg_loss
torch.save(model.state_dict(), 'best_lightgcn.pt')

if epoch % 10 == 0:
print(f"Epoch {epoch:3d} BPR Loss: {avg_loss:.4f} Best: {best_loss:.4f}")

return model


# ──────────────────────────────────────────────
# Evaluation: Recall@K and NDCG@K
# ──────────────────────────────────────────────

def evaluate_lightgcn(
model: LightGCN,
edge_index: torch.Tensor,
test_pairs: dict, # {user_id: [positive_item_ids]}
train_pairs: dict, # {user_id: [interacted_item_ids]} - to exclude
K: int = 20,
device: str = 'cpu',
):
model.eval()
model = model.to(device)
edge_index = edge_index.to(device)

with torch.no_grad():
user_emb, item_emb = model(edge_index)

recalls, ndcgs = [], []

for user_id, pos_items in test_pairs.items():
if not pos_items:
continue

u_vec = user_emb[user_id] # [d]

# Score all items
scores = torch.mv(item_emb, u_vec) # [n_items]

# Mask out training items
for train_item in train_pairs.get(user_id, []):
scores[train_item] = -1e9

# Top-K items
topk = torch.topk(scores, K).indices.cpu().tolist()

# Recall@K
hits = len(set(topk) & set(pos_items))
recall = hits / min(len(pos_items), K)
recalls.append(recall)

# NDCG@K
dcg = 0.0
for rank, item in enumerate(topk):
if item in pos_items:
dcg += 1.0 / np.log2(rank + 2)
ideal_dcg = sum(1.0 / np.log2(r + 2) for r in range(min(len(pos_items), K)))
ndcgs.append(dcg / ideal_dcg if ideal_dcg > 0 else 0.0)

return {
f'Recall@{K}': float(np.mean(recalls)),
f'NDCG@{K}': float(np.mean(ndcgs)),
}

Cold-Start with GNNs

Cold-start is the recommendation problem for new users or items with no interaction history. Standard LightGCN (lookup embeddings only) cannot handle cold-start - there is no embedding to retrieve for a new node.

Item cold-start solutions:

  1. Content features as initial embeddings (PinSage approach): use image embeddings, text embeddings, or item metadata as the initial node feature xvx_v. New items immediately get embeddings from content alone, before any interactions. The GNN then refines these with collaborative signals as interactions accumulate.

  2. Dropout-based augmentation (DropoutNet, Volkovs et al. 2017): randomly drop collaborative signals during training, forcing the model to use content features alone. At inference time for cold items, only content features are used.

  3. Meta-learning (MAMO, Pan et al. 2019): treat each new user/item as a few-shot learning problem. Train a meta-learner that initializes embeddings from a few interactions.

User cold-start solutions:

  1. Session-based recommendation: use the first few clicks in a new session to compute an ad-hoc embedding via GNN message passing
  2. Demographic features: encode user attributes (country, device, signup date) as initial node features
  3. Popularity fallback: return globally popular items until enough interactions accumulate (typically 5-10 interactions)

Comparison: MF vs NGCF vs LightGCN vs PinSage

DimensionMF-BPRNGCFLightGCNPinSage
Multi-hop signalImplicitExplicit (3 hops)Explicit (K hops)Explicit (2 hops)
Message functionNoneW1hi+W2(hihu)W_1 h_i + W_2(h_i \odot h_u)None (topology only)Weighted agg + dense
Edge featuresNoNoNoYes (content)
Cold-start itemsNoNoNoYes (content features)
Scale (max tested)BillionsMillionsMillions3 Billion nodes
Neighborhood samplingN/AFull graphFull graphPPR random walk
Recall@20 (Gowalla)0.12910.15470.1830N/A (different data)
Training complexity$O(E\cdot d)$$O(K \cdot

Production Deployment - Two-Stage Retrieval

In production, GNN-based recommendation follows a two-stage architecture:

Stage 1 - Retrieval: GNN-computed embeddings are indexed in FAISS (Facebook AI Similarity Search) with an IVF or HNSW index. At request time, the user's embedding is used to query the ANN index, retrieving the top-500 most similar items in under 10 milliseconds. GNN embeddings are recomputed offline periodically (daily or weekly) and pushed to the serving index.

Stage 2 - Reranking: The top-500 candidates from retrieval pass through a more expensive ranker with richer features: real-time user context (current session, recent clicks, time of day), item metadata, diversity constraints, business rules (no out-of-stock items, boost new items). This reranker typically runs in 50-100 milliseconds.

The GNN runs entirely offline. The online serving cost is just an ANN lookup plus the reranker - both are highly optimized.

Embedding refresh cadence:

  • Pinterest: nightly full re-embedding of all 3B pins
  • For dynamic systems: incremental updates using streaming GNNs (GraphSAINT, ROLAND)

Knowledge Graph-Enhanced Recommendation

KGCN (Wang et al., 2019) extends GNN-based recommendation by adding a knowledge graph of item attributes. Instead of only user-item interactions, the graph includes:

  • User-item interactions (click, purchase)
  • Item-attribute relationships from a KG (movie → genre → action, movie → director → Nolan)

Message passing on this enriched graph enables the model to learn that users who liked a Nolan film tend to like other Nolan films, not just films in the same genre - a more precise semantic similarity than collaborative filtering alone provides.

KG integration is especially valuable for:

  • Long-tail items: few interactions but rich attribute graph
  • Cross-domain recommendation: transfer across categories via shared KG entities
  • Explainability: recommendation paths through the KG provide human-readable explanations

YouTube Resources

TitleChannelFocus
Graph Neural Networks for Recommendations (CS224W)Stanford OnlineGNN recsys theory · LightGCN · PinSage architecture
LightGCN Paper ExplainedAleksa GordicLightGCN derivation · ablation results · why simplicity wins
PinSage: Building Graph Convolutional Networks for PinterestPinterest EngineeringPinSage engineering · curriculum training · production scale
Recommender Systems with GNNsWeights and BiasesPyG implementation walkthrough · BPR loss
Graph-Based Recommender Systems at ScaleRecSys ConferenceIndustrial experience reports · engineering challenges

Common Pitfalls

:::danger Using validation Recall@K on the training graph without masking When evaluating, you must mask out all training interactions before computing top-K recommendations. If you score items a user already interacted with, you inflate Recall@K - the model trivially retrieves items it was trained on. Always filter training positives from the ranked list before computing any retrieval metric. :::

:::danger Training LightGCN with too many layers on sparse graphs LightGCN with K=3 or K=4 layers works well on dense graphs. On sparse user-item graphs (fewer than 10 interactions per user), deeper propagation causes over-smoothing: user embeddings become indistinguishable because most users share similar sparse neighborhoods. For sparse graphs, use K=1 or K=2 layers. Run ablation on layer count - don't assume more layers means better. :::

:::warning Uniform negative sampling underestimates hard cases Uniform random negatives are too easy - the model quickly learns to score positive items above random items. This leads to fast convergence on easy cases but poor performance on hard cases (similar but non-interacted items). Use popularity-biased negatives (sample proportional to ni0.75n_i^{0.75}) or in-batch negatives (treat other users' positive items in the same batch as negatives for the current user). :::

:::warning Offline metrics do not predict online A/B performance reliably Recall@20 and NDCG@10 are computed from historical data. They measure the model's ability to predict held-out historical interactions - not necessarily the interactions users would make if shown the model's recommendations. Online A/B tests regularly show cases where a model with lower offline Recall@20 has higher click-through rate because it surfaces novel, serendipitous items. Always run A/B tests before declaring a GNN improvement "real." :::


Interview Q&A

Q: Why do GNNs outperform matrix factorization for collaborative filtering? What structural information do they capture that MF misses?

A: GNNs explicitly propagate embedding information along the user-item graph, capturing multi-hop collaborative filtering signals. After KK layers, a user's embedding aggregates information from all users within KK hops - including users who interacted with the same items, and items those users also liked. MF captures this only implicitly through the shared latent space: two users end up with similar embeddings if they interact with similar items, but there is no direct propagation of structural information. The practical difference shows up on long-tail items: MF fails because long-tail items have few direct interactions, but a GNN can propagate similarity information from structurally adjacent items even when direct interaction count is low. On standard benchmarks like Gowalla and Yelp2018, LightGCN improves Recall@20 by roughly 18% over MF-BPR.

Q: LightGCN removes the weight matrix and nonlinear activation from NGCF and still outperforms it. Why?

A: The ablation study in He et al. (2020) shows that each component removed from NGCF improves performance: removing nonlinearity helps, removing the feature transformation WW helps, and removing the interaction term W2(hihu)W_2(h_i \odot h_u) also helps. The reasons: (1) In LightGCN, the only learnable parameters are the initial embeddings - not weights in the message function. Adding weight matrices introduces redundant parameters that interact poorly with the symmetric normalization and overfit on smaller training sets. (2) Nonlinearity is unnecessary because the embeddings being propagated are already in a high-dimensional space; stacking linear propagation with linear combination of layer outputs is expressive enough. (3) The task is ranking, not regression - the model does not need to learn a complex functional form, just a good embedding space where users and items that interact are close. Simpler training objectives converge to better ranking solutions.

Q: How does PinSage scale to 3 billion nodes? What is importance-based neighborhood sampling?

A: PinSage uses three key engineering decisions. First, random walk importance sampling: instead of sampling a fixed random neighborhood, PinSage runs r=50r=50 random walks of length =100\ell=100 from each target node and uses the top-TT most visited nodes as the neighborhood. This approximates Personalized PageRank, giving neighborhoods that are structurally more relevant than random samples. Second, MapReduce offline pipeline: Personalized PageRank neighborhoods are precomputed offline using Apache Spark for all 3B pins and cached to disk. Training reads from cached neighborhoods, never touching the full graph online. Third, content features for initial embeddings: using pretrained image and text embeddings as node features enables cold-start and means the GNN only needs to learn 2 layers of aggregation, not a full node embedding table for 3B items. Inference runs in MapReduce shards, each shard computing embeddings for a subset of pins using neighbors' cached embeddings.

Q: How do you handle the cold-start problem in GNN-based recommendation?

A: Standard LightGCN cannot handle cold-start - new users and items have no position in the interaction graph, so there is no embedding to retrieve. There are three main approaches. Content-based initialization (PinSage): use content features (image embeddings, text features, item metadata) as initial node features. New items immediately get embeddings from content before any interactions. The GNN refines these embeddings as interactions accumulate. Dropout-based regularization (DropoutNet): during training, randomly drop the collaborative signal for some users/items, forcing the model to rely on content features alone. At inference time, cold items use only content features. Meta-learning: treat new users as few-shot learning problems - train a meta-learner that initializes embeddings from 1-5 interactions. In practice, most production systems use a fallback strategy: return globally popular items for users with fewer than 5 interactions, then switch to the full GNN once enough data accumulates.

Q: Design a GNN-based recommendation system for a music streaming service with 50M users and 80M tracks. What are the key architectural decisions?

A: The key decisions and reasoning: (1) Graph construction: build a bipartite user-track interaction graph from listening history. Also add track-track edges from a knowledge graph (same artist, same album, similar audio features from a pretrained audio model). (2) Model choice: LightGCN with K=3 layers for the bipartite graph plus content features for cold-start tracks (new releases). Use contrastive learning augmentation (SimGCL) to handle the noise in listening data - a user can have a track play without actually liking it (passive listening). (3) Negative sampling: popularity-biased negatives (proportional to track play count) plus in-batch hard negatives from the current mini-batch. (4) Scalability: mini-batch training with GraphSAINT or ClusterGCN - the full user-track graph (50M × 80M) does not fit in memory. Run neighborhood sampling per mini-batch. (5) Retrieval: index track embeddings in FAISS with HNSW graph. At inference, retrieve top-500 candidates in under 5ms, rerank with a point-wise ranker that incorporates session context and time-of-day. (6) Evaluation: Recall@10, NDCG@10 on held-out listening events, plus online A/B metrics (stream completion rate, session length).

Q: What is the BPR loss and why is it preferred over binary cross-entropy for recommendation?

A: BPR (Bayesian Personalized Ranking) loss is L=(u,i,j)lnσ(y^uiy^uj)\mathcal{L} = \sum_{(u,i,j)} -\ln\sigma(\hat{y}_{ui} - \hat{y}_{uj}) where (u,i)(u, i) is a positive pair and (u,j)(u, j) is a negative pair. It directly optimizes pairwise ranking: positive items should score higher than negative items. Binary cross-entropy (BCE) treats each interaction as an independent binary classification: is this user-item pair positive (1) or negative (0)? For recommendation, BCE has two problems. First, implicit feedback is noisy: a non-interaction does not mean the user dislikes the item - they may simply not have been exposed to it. BCE treats all non-interactions as negatives, introducing massive label noise. BPR only requires relative ordering, which is more robust to this noise. Second, BCE optimizes absolute scores, but recommendation only cares about relative ranking - it does not matter whether the top item scores 0.9 or 0.7, only that it scores higher than the 501st item. BPR directly optimizes this objective and converges faster for sparse implicit feedback data.


Temporal Dynamics - Handling a Changing Graph

Real recommendation graphs are dynamic. New users arrive, new items are added, interaction patterns shift with seasons and trends. A model trained on a static snapshot of the graph degrades as the graph evolves.

The Staleness Problem

LightGCN embeddings are precomputed offline and cached. If embeddings are refreshed weekly, any user who signs up on day 1 of the week has stale embeddings by day 7. For high-velocity graphs (e-commerce during sales, news recommendation, Twitter trends), weekly refresh is far too slow.

Solutions:

  1. Daily full re-embedding: recompute all embeddings nightly. Acceptable for most use cases. Pinterest refreshes daily.

  2. Streaming GNNs (ROLAND, 2022): maintain a persistent model that can be updated incrementally as new edges arrive. Each new interaction triggers a local embedding update for the affected user and item, without recomputing the entire graph.

  3. Two-level embedding: keep a "base" embedding from full weekly re-training and an "incremental" embedding from recent interactions that is updated more frequently. Combine both at inference time.

  4. Temporal graph attention (TGAT): augment graph attention with timestamps - the weight αvu\alpha_{vu} depends not just on features but on the time elapsed since the interaction. Recent interactions receive higher weight.

Graph Structure Evolution

The topology of the user-item graph also changes. Popular items accumulate edges rapidly, increasing their degree and potentially causing their embeddings to over-dominate aggregation. Monitor:

  • Degree distribution shift: if some items go from 100 to 100K interactions, the symmetric normalization 1dudi\frac{1}{\sqrt{d_u d_i}} provides some protection
  • New user cold-start rate: if your user base grows 5% per week, 5% of recommendations will hit the cold-start path
  • Item lifecycle: fashion items have short lifecycles, so temporal decay of interaction weights matters more in fashion than in music

Contrastive Learning for GNN Recommendation

A growing research direction augments BPR training with contrastive objectives to learn more robust embeddings. SimGCL (2022) adds uniform noise to graph embeddings to create augmented views and trains with InfoNCE contrastive loss alongside BPR:

LCL=ulogexp(sim(zu,zu)/τ)vexp(sim(zu,zv)/τ)\mathcal{L}_{CL} = \sum_{u} -\log \frac{\exp(\text{sim}(z_u, z_u') / \tau)}{\sum_{v} \exp(\text{sim}(z_u, z_v') / \tau)}

where zuz_u and zuz_u' are two augmented views of the same user embedding (created by adding different random noise), τ\tau is temperature, and the sum is over all users in the batch.

The contrastive objective encourages each user's embedding to be consistent across augmentations (similar to itself) while being different from other users' embeddings. This reduces popularity bias - popular users dominate BPR loss, but contrastive loss treats all users equally.

SimGCL achieves state-of-the-art results on Gowalla, Yelp2018, and Amazon-Book benchmarks, improving LightGCN's NDCG@20 by 8-12% without any architectural changes - just better training.


Practical Evaluation Setup

Proper evaluation of a GNN-based recommender requires careful attention to data splitting:

def temporal_split(interactions, test_ratio=0.1, val_ratio=0.1):
"""
Temporal split: use the most recent interactions for test/val.
This simulates real deployment - the model never sees future interactions.
Avoid random splits, which leak future data into training.
"""
interactions = sorted(interactions, key=lambda x: x['timestamp'])
n = len(interactions)
n_test = int(n * test_ratio)
n_val = int(n * val_ratio)
train = interactions[:n - n_test - n_val]
val = interactions[n - n_test - n_val : n - n_test]
test = interactions[n - n_test:]
return train, val, test


def leave_one_out_split(user_interactions):
"""
Leave-one-out: for each user, hold out the last interaction for test.
Standard for papers but not recommended for production evaluation
(masks temporal leakage).
"""
train, test = {}, {}
for user_id, items in user_interactions.items():
if len(items) > 1:
train[user_id] = items[:-1]
test[user_id] = [items[-1]]
return train, test

:::tip Use temporal splits, not random splits Random splits are the most common evaluation mistake in recommender systems research. A random split of interactions will assign some future interactions to the training set and some past interactions to the test set - the model sees "future" data during training. This inflates all metrics and does not reflect real deployment. Always use temporal splits: train on all interactions before time TT, evaluate on interactions at time TT or later. :::


Summary

GNN-based recommendation is built on the insight that user-item interaction graphs encode multi-hop collaborative filtering signals that matrix factorization cannot explicitly leverage. The evolution from NGCF to LightGCN demonstrated a counterintuitive result: removing learned transformations and nonlinearities from the message passing layers consistently improves recommendation quality, because the task is about learning a good embedding space through graph propagation, not a complex functional mapping.

PinSage solved the engineering problem that academia ignored - how to train and serve a GNN on a graph with 3 billion nodes and 18 billion edges. The solutions (importance-based neighborhood sampling, MapReduce inference, curriculum training, content features for cold-start) became the blueprint for industrial GNN deployment.

In production, GNN embeddings serve as the backbone of two-stage retrieval systems: offline-computed embeddings are indexed in FAISS for fast ANN retrieval, followed by a lightweight online reranker that incorporates real-time context. The GNN never runs at serving time - only at embedding time, making latency requirements achievable at scale.

The next frontier in GNN recommendation is handling temporal dynamics - streaming GNN updates, temporal attention, and contrastive learning augmentation - to keep recommendation quality high as graphs evolve continuously at production scale.


Session-Based Recommendation with GNNs

Traditional collaborative filtering (LightGCN, PinSage) requires historical interactions for each user. Session-based recommendation handles a different scenario: a new or anonymous user produces a sequence of clicks in a single session, and you must recommend the next item based only on that session.

SR-GNN - Session Recommendation with GNNs

Shu Wu et al. (2019) modeled each session as a directed graph. In a session [v1,v2,v3,v2,v4][v_1, v_2, v_3, v_2, v_4], items are nodes and transitions between consecutive items are directed edges. The edge from v2v_2 to v3v_3 gets weight 2 if the transition v2v3v_2 \to v_3 occurs twice in the session.

Message passing on this session graph propagates context:

hv(k+1)=GRU ⁣(hv(k),uN(v)αuv(k)hu(k))h_v^{(k+1)} = \text{GRU}\!\left(h_v^{(k)},\, \sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} h_u^{(k)}\right)

After KK rounds, a soft-attention readout distinguishes the "current intent" (most recently clicked) from the "general preference" (average of all clicks). The recommendation score for each candidate item is the dot product with this session representation.

Why GNNs over RNNs for sessions: an RNN processes the sequence left-to-right and forgets distant items. A GNN on the session graph can directly propagate between any two items in the session, regardless of when they occurred. If a user clicks item AA at step 1, then B,C,DB, C, D, then AA again at step 5, the GNN naturally recognizes AA as high-importance through its high in-degree, whereas an RNN might underweight AA because of its early position.


Heterogeneous Graph Recommendation

Real recommendation systems have multiple node types and edge types, not just users and items. A movie platform has:

  • Node types: Users, Movies, Actors, Directors, Genres
  • Edge types: (User, watched, Movie), (Movie, starring, Actor), (Movie, directedBy, Director), (Movie, belongsTo, Genre)

Standard LightGCN cannot handle this - it assumes a homogeneous bipartite graph.

HAN - Heterogeneous Attention Network

Wang et al. (2019) introduced HAN, which performs message passing separately for each edge type (meta-path), then combines the results:

For a user, HAN computes:

  1. Aggregation along "User-watches-Movie" edges → gets similar users via shared movies
  2. Aggregation along "User-watches-Movie-Actor" meta-path → gets items with same actors as watched movies
  3. Aggregation along "User-watches-Movie-Genre" meta-path → genre preferences

Each meta-path aggregation produces a separate embedding. A meta-path attention layer then weights which meta-paths are most relevant for the current user:

hufinal=mβmhu(metapath m)h_u^{final} = \sum_{m} \beta_m \cdot h_u^{(meta-path\ m)}

where βm\beta_m are learned attention weights over meta-paths.

RGCN - Relational GCN

Schlichtkrüll et al. (2018) proposed RGCN, which uses separate weight matrices per relation type:

hv(k+1)=σ ⁣(W0(k)hv(k)+rRuNr(v)1Nr(v)Wr(k)hu(k))h_v^{(k+1)} = \sigma\!\left(W_0^{(k)} h_v^{(k)} + \sum_{r \in \mathcal{R}} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{|\mathcal{N}_r(v)|} W_r^{(k)} h_u^{(k)}\right)

Each relation type rr has its own weight matrix WrW_r. Basis decomposition reduces parameters: Wr=barbVbW_r = \sum_b a_{rb} V_b (each WrW_r is a linear combination of BB shared basis matrices VbV_b). RGCN is the multi-relational MPNN equivalent.


Diversity and Serendipity in GNN Recommendation

A pure relevance-maximizing recommender will converge to showing users only items very similar to what they already liked - the "filter bubble" problem. GNN-based systems exacerbate this because graph propagation inherently clusters similar items together in embedding space.

Maximal Marginal Relevance (MMR) re-ranks the top-K retrieved items for diversity:

MMR(i)=λrelevance(u,i)(1λ)maxjSsim(i,j)\text{MMR}(i) = \lambda \cdot \text{relevance}(u, i) - (1-\lambda) \cdot \max_{j \in S} \text{sim}(i, j)

where SS is the set of already-selected items and λ\lambda balances relevance vs diversity.

Determinantal Point Processes (DPP) select a set of items that are simultaneously high-relevance and mutually diverse, modeled via the determinant of the kernel matrix:

P(S)det(LS)P(S) \propto \det(L_S)

where Lij=relevance(u,i)k(i,j)relevance(u,j)L_{ij} = \text{relevance}(u, i) \cdot k(i, j) \cdot \text{relevance}(u, j) and k(i,j)k(i, j) is a similarity kernel. Maximizing the determinant selects diverse, high-relevance subsets. DPPs are used in production at Netflix and Google to avoid filter bubbles.

GNN embeddings naturally serve as the kernel for DPP: k(i,j)=exp(eiej2/σ2)k(i, j) = \exp(-\|e_i - e_j\|^2 / \sigma^2) where eie_i are GNN-computed item embeddings. Items that are close in GNN embedding space (high structural similarity from the interaction graph) are penalized by the DPP kernel, promoting diversity.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.