GNNs for Recommender Systems
The Real Interview Moment
"Pinterest has 300 million users, 3 billion pins, and hundreds of billions of engagement signals. Their recommendation system runs in real time. How would you architect a GNN-based recommender for this scale?"
This is not hypothetical. In 2018, the Pinterest engineering team published PinSage - a GNN trained on a graph with 3 billion nodes and 18 billion edges, deployed in production to serve 300 million users. It was the first industrial-scale application of graph neural networks, and it forced the field to think seriously about what it actually takes to deploy GNNs beyond benchmark datasets with a few thousand nodes.
The fundamental insight behind GNN-based recommendation is simple but powerful. Traditional matrix factorization (MF) treats each user and item as an independent embedding to be learned. It captures first-order signals - user likes item because they interacted with it. But it misses multi-hop signals: user likes item because user liked item , and users who liked also liked , which is similar to , which shares properties with . This is the collaborative filtering signal that made Netflix Prize contestants' hair turn grey - it's real, it matters, and matrix factorization only captures it indirectly through the shared latent space.
A graph neural network on the user-item interaction graph propagates information directly along these multi-hop paths. After layers, a user embedding aggregates information from -hop neighborhood - including items-of-items-of-users and users-of-items-of-items. This is structured, principled multi-hop collaborative filtering, not a side effect of matrix factorization's optimization.
The engineering challenge is that user-item graphs are enormous, sparse, and dynamic. New users arrive with no interaction history. New items are added continuously. The full graph never fits in GPU memory. Building a GNN-based recommender that works in this environment requires solving problems that benchmark papers routinely ignore.
Why This Exists - The Limits of Matrix Factorization
Matrix factorization (MF) decomposes the user-item interaction matrix (M users, N items) into user embeddings and item embeddings , predicting interaction .
What MF gets right:
- Learns compressed representations of users and items
- Captures user taste clusters through the shared latent space
- Scalable: parameters
What MF misses:
- Multi-hop signals: MF does not explicitly propagate the "user A liked item B which was also liked by users similar to A" signal
- Graph structure: the user-item graph has rich topology - hub items (popular), niche items (few interactions but dedicated fans), user clusters - MF sees only the interaction matrix
- Higher-order connectivity: items are similar not just through shared users but through chains of shared users through other items
The classical symptom: MF performs well on popular items but poorly on long-tail items, because long-tail items have few direct interactions. A GNN can propagate information from structurally similar items even without direct co-interaction.
Historical Context
The sequence of GNN-based recommenders follows a clear progression:
| Year | Paper | Key Idea |
|---|---|---|
| 2018 | PinSage (Ying et al., Pinterest) | First billion-scale GNN in production; random walk neighborhood sampling |
| 2019 | NGCF (Wang et al., SIGIR) | Explicit embedding propagation on user-item graph; interaction-based messages |
| 2020 | LightGCN (He et al., SIGIR) | Remove feature transform + nonlinearity - just weighted sum; surprisingly beats NGCF |
| 2020 | DGCF (Wang et al., SIGIR) | Disentangled intent-based propagation |
| 2021 | SimGCL (Yu et al.) | Contrastive learning augmentation for GNN-based recsys |
| 2022 | UltraGCN (Mao et al.) | Approximate LightGCN without explicit message passing |
PinSage solved the engineering problem. NGCF showed that multi-hop propagation beats pure MF. LightGCN showed that simpler is better - a result that resonated through the entire field.
Graph Construction from Implicit Feedback
In real systems, you rarely have explicit ratings (1-5 stars). You have implicit feedback: clicks, purchases, watch time, saves, repins.
Bipartite Graph Construction
Build a bipartite graph where:
- = user nodes
- = item nodes
- = interaction edges ( if user interacted with item )
No edges exist between user-user or item-item pairs in the raw bipartite graph. Message passing on this graph naturally enables multi-hop collaborative filtering:
- 1-hop from user : items interacted with
- 2-hop from user : users who interacted with the same items as (similar users)
- 3-hop from user : items that similar users interacted with (collaborative filtering signal)
Multi-hop intuition: User A and User B both interacted with Item 2. That makes User A and User B structurally similar. User B also interacted with Item 3. Therefore Item 3 is likely relevant to User A - even though User A has never seen Item 3. This is precisely the 3-hop collaborative filtering signal a GNN captures by propagating over the bipartite graph.
Negative Sampling
You need negative examples (user-item pairs the user did not interact with) to train a ranking model. Several strategies:
Uniform random sampling: sample negative item uniformly at random from all non-interacted items. Simple but low signal - most negative samples are obviously irrelevant.
Popularity-biased sampling: sample negative items proportional to their interaction count where and is item interaction count. Harder negatives: popular items a user didn't interact with despite having opportunity.
Hard negative sampling: mine items with high predicted scores but no interaction - the model's current mistakes. Most informative, but requires periodic recomputation and risks training on false negatives (items the user would like but simply hasn't seen).
PinSage curriculum training: start with easy negatives, progressively introduce harder negatives as training proceeds. Curriculum learning stabilizes training and improves final performance.
NGCF - Neural Graph Collaborative Filtering
Wang et al. (2019) introduced NGCF as the first GNN specifically designed for collaborative filtering with explicit interaction-based messages.
NGCF Message Function
For a user-item edge , NGCF computes messages using an element-wise product:
The term is the key innovation: the message depends on the interaction between the user and item embeddings, not just the item embedding. This allows the model to learn which dimensions of the item embedding are important given the user's current preference state.
NGCF Update
After aggregating messages from all neighbors:
NGCF Final Prediction
Concatenate embeddings from all layers:
Predict interaction score:
Three propagation layers were found optimal. NGCF beat pure MF and many neural MF variants on standard benchmarks.
LightGCN - Simpler and Better
He et al. (2020) asked a sharp question: NGCF uses feature transformation and nonlinear activation at every layer. What happens if we remove them?
The answer was surprising: removing the transformation and activation consistently improved performance.
LightGCN Propagation Rule
LightGCN reduces the MPNN to its bare minimum - just weighted sum aggregation:
No weight matrix, no activation function. The normalization is the symmetric GCN normalization - fixed, not learned.
LightGCN Final Embedding
Combine representations from all layers:
Uniform layer combination (can also be learned). Final prediction:
Why Does Removing Complexity Help?
The ablation study in the LightGCN paper is illuminating:
| Model | Recall@20 | NDCG@10 |
|---|---|---|
| MF-BPR (no GNN) | 0.0454 | 0.0351 |
| NGCF | 0.0579 | 0.0477 |
| NGCF without W2 (no interaction term) | 0.0567 | 0.0458 |
| NGCF without nonlinearity | 0.0591 | 0.0488 |
| NGCF without both (= LightGCN) | 0.0639 | 0.0525 |
The feature transformation in NGCF introduces redundant parameters that interact poorly with the symmetric normalization. The nonlinearity does not help because the input embeddings (lookup table) are the only parameters being optimized - nonlinearity at every layer overcomplicates the gradient flow. LightGCN strips the architecture to its essential function: propagate embeddings along the graph to inject multi-hop collaborative signals.
Matrix Form of LightGCN
In matrix form, LightGCN is a simple spectral graph convolution:
where is the symmetrically normalized adjacency matrix of the bipartite graph, and contains all user and item embeddings at layer . The final embedding is the mean of layer embeddings:
This is a polynomial filter on the graph spectrum - LightGCN is a -th degree Chebyshev polynomial filter with uniform coefficients, applied to learned embeddings.
PinSage - Billion-Scale GNN at Pinterest
PinSage (Ying et al., 2018) is the industrial GNN that proved graph-based recommendation works at production scale. It operates on a graph with 3 billion pins (content items), 100+ billion edges, and must produce embeddings for new pins in milliseconds.
Importance-Based Neighborhood Sampling
Standard GNN neighborhood sampling (GraphSAGE-style) samples a fixed number of neighbors uniformly. For billion-scale graphs, this still touches enormous subgraphs. PinSage uses random walk importance sampling:
- From node , run random walks of length
- Count how many times each node is visited across all walks:
- The top- most visited nodes define the neighborhood
This is an approximation to Personalized PageRank (PPR) - the stationary distribution of a random walk starting at with restart probability . PPR naturally assigns higher importance to nodes that are structurally close (nearby in the graph) and popular (many paths lead to them). Using PPR-based neighborhoods instead of random sampling gives better-quality embeddings because high-PPR neighbors are genuinely similar, not just random close-by nodes.
PinSage Architecture
Input: pin content features (image embedding from ResNet, text from TF-IDF)
↓
Layer 1: Importance-weighted aggregation from top-50 neighbors
→ concatenate with self
→ linear + batch norm + ReLU
↓
Layer 2: Same with 2-hop importance-sampled neighborhoods
↓
Output: 256-dimensional L2-normalized pin embedding
PinSage uses content features (image embeddings, text features) as initial node features - not learned lookup embeddings. This enables cold-start: a new pin with no engagement history can still get an embedding from its content features. Contrast with LightGCN, which uses only learned embeddings and cannot handle new items.
MapReduce Pipeline
Training the full model on 3B nodes requires distributed computation:
- Offline preprocessing: compute Personalized PageRank neighborhoods for all pins using Apache Spark. Cached to disk.
- Mini-batch construction: each training batch samples a set of (pin, positive context, negative) triples. Subgraph is extracted from cached neighbors.
- GPU training: mini-batches of subgraphs run on GPU. Model parameters synchronized across workers.
- Inference: compute embeddings for all 3B pins in MapReduce - each worker handles a shard, uses the trained GNN to compute embeddings from neighbors' cached embeddings.
Curriculum Training
PinSage trains with progressively harder negatives:
- Epoch 1-5: random negatives (any non-interacted pin)
- Epoch 6-10: within-batch hard negatives (highest-scoring negatives from current batch)
- Epoch 11+: fully mined hard negatives (run inference, find pins closest to query but not interacted with)
This curriculum helps because the model needs to learn basic similarity before it can meaningfully mine hard negatives. Starting with hard negatives from a random model produces noisy gradient signal.
BPR Loss for GNN-Based Recommendation
All three models (NGCF, LightGCN, PinSage) use Bayesian Personalized Ranking (BPR) loss, which directly optimizes the pairwise ranking objective:
For training triple where user interacted with item (positive) but not item (negative):
is the sigmoid function. The loss pushes - the positive item should score higher than the negative. The L2 regularization is applied only to the initial embeddings (the only parameters in LightGCN) to prevent overfitting.
Why BPR over binary cross-entropy? Because recommendation is fundamentally a ranking problem. You do not care about the absolute scores, only the ordering. BPR directly optimizes pairwise ranking and converges faster than BCE for sparse implicit feedback.
Full LightGCN Implementation in PyTorch Geometric
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import LGConv
from torch_geometric.utils import degree
import numpy as np
from typing import Optional
class LightGCN(nn.Module):
"""
LightGCN: Light Graph Convolutional Network for Recommendation.
He et al., SIGIR 2020.
Architecture:
- Learnable user and item embedding table (initial embeddings)
- K rounds of simple weighted sum propagation on bipartite graph
- Final embedding = uniform average of all layer embeddings
- Prediction = dot product of final user and item embeddings
"""
def __init__(
self,
n_users: int,
n_items: int,
embed_dim: int = 64,
n_layers: int = 3,
reg_lambda: float = 1e-4,
):
super().__init__()
self.n_users = n_users
self.n_items = n_items
self.embed_dim = embed_dim
self.n_layers = n_layers
self.reg_lambda = reg_lambda
# Initial embeddings - the ONLY learnable parameters in LightGCN
self.user_embedding = nn.Embedding(n_users, embed_dim)
self.item_embedding = nn.Embedding(n_items, embed_dim)
# LGConv: simple normalized aggregation (no learned weights)
self.convs = nn.ModuleList([LGConv() for _ in range(n_layers)])
self._init_weights()
def _init_weights(self):
nn.init.normal_(self.user_embedding.weight, std=0.1)
nn.init.normal_(self.item_embedding.weight, std=0.1)
def forward(self, edge_index: torch.Tensor):
"""
edge_index: [2, E] - bipartite graph edges.
Convention: user nodes indexed 0..n_users-1,
item nodes indexed n_users..n_users+n_items-1.
Returns:
user_emb: [n_users, embed_dim] - final user embeddings
item_emb: [n_items, embed_dim] - final item embeddings
"""
# Stack initial embeddings: [n_users + n_items, embed_dim]
x = torch.cat([
self.user_embedding.weight,
self.item_embedding.weight,
], dim=0)
# Collect embeddings at each layer
all_embs = [x]
for conv in self.convs:
x = conv(x, edge_index) # normalized sum aggregation
all_embs.append(x)
# Final embedding: uniform mean across all layers (eq. 11 in paper)
stacked = torch.stack(all_embs, dim=1) # [N, K+1, d]
final_emb = stacked.mean(dim=1) # [N, d]
# Split back into user and item embeddings
user_emb = final_emb[:self.n_users] # [n_users, d]
item_emb = final_emb[self.n_users:] # [n_items, d]
return user_emb, item_emb
def predict(self, user_ids, item_ids, edge_index):
"""Predict scores for (user, item) pairs."""
user_emb, item_emb = self.forward(edge_index)
u = user_emb[user_ids] # [B, d]
i = item_emb[item_ids] # [B, d]
return (u * i).sum(dim=-1) # [B]
def bpr_loss(self, users, pos_items, neg_items, user_emb, item_emb):
"""
BPR loss with L2 regularization on initial embeddings.
users: [B] - user indices
pos_items: [B] - positive item indices
neg_items: [B] - negative item indices
"""
u_emb = user_emb[users]
pos_emb = item_emb[pos_items]
neg_emb = item_emb[neg_items]
# Pairwise scores
pos_scores = (u_emb * pos_emb).sum(dim=-1) # [B]
neg_scores = (u_emb * neg_emb).sum(dim=-1) # [B]
# BPR loss
bpr = -F.logsigmoid(pos_scores - neg_scores).mean()
# L2 regularization on INITIAL embeddings (not propagated)
u_init = self.user_embedding(users)
pi_init = self.item_embedding(pos_items)
ni_init = self.item_embedding(neg_items)
reg = self.reg_lambda * (
u_init.norm(2).pow(2) +
pi_init.norm(2).pow(2) +
ni_init.norm(2).pow(2)
) / len(users)
return bpr + reg
# ──────────────────────────────────────────────
# Training loop
# ──────────────────────────────────────────────
def build_bipartite_edge_index(
user_item_pairs: list, # [(user_id, item_id), ...]
n_users: int,
) -> torch.Tensor:
"""
Build bidirectional edge index for bipartite graph.
Item nodes are offset by n_users.
"""
users = [u for u, i in user_item_pairs]
items = [i + n_users for u, i in user_item_pairs]
# Bidirectional: user→item and item→user
src = torch.tensor(users + items, dtype=torch.long)
dst = torch.tensor(items + users, dtype=torch.long)
return torch.stack([src, dst], dim=0) # [2, 2*len(pairs)]
def train_lightgcn(
train_pairs: list,
n_users: int,
n_items: int,
embed_dim: int = 64,
n_layers: int = 3,
epochs: int = 100,
batch_size: int = 2048,
lr: float = 1e-3,
):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LightGCN(n_users, n_items, embed_dim, n_layers).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
edge_index = build_bipartite_edge_index(train_pairs, n_users).to(device)
# Build positive item set per user for negative sampling
user_pos_items = {}
for u, i in train_pairs:
user_pos_items.setdefault(u, set()).add(i)
train_users = torch.tensor([u for u, _ in train_pairs], dtype=torch.long)
train_items = torch.tensor([i for _, i in train_pairs], dtype=torch.long)
best_loss = float('inf')
for epoch in range(epochs):
model.train()
# Shuffle training pairs
perm = torch.randperm(len(train_pairs))
train_users = train_users[perm]
train_items = train_items[perm]
total_loss = 0.0
n_batches = 0
for start in range(0, len(train_pairs), batch_size):
end = min(start + batch_size, len(train_pairs))
batch_users = train_users[start:end].to(device)
batch_pos = train_items[start:end].to(device)
# Popularity-biased negative sampling (simple version)
batch_neg = torch.randint(0, n_items, (end - start,), device=device)
# Forward pass
user_emb, item_emb = model(edge_index)
loss = model.bpr_loss(batch_users, batch_pos, batch_neg,
user_emb, item_emb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
n_batches += 1
avg_loss = total_loss / n_batches
if avg_loss < best_loss:
best_loss = avg_loss
torch.save(model.state_dict(), 'best_lightgcn.pt')
if epoch % 10 == 0:
print(f"Epoch {epoch:3d} BPR Loss: {avg_loss:.4f} Best: {best_loss:.4f}")
return model
# ──────────────────────────────────────────────
# Evaluation: Recall@K and NDCG@K
# ──────────────────────────────────────────────
def evaluate_lightgcn(
model: LightGCN,
edge_index: torch.Tensor,
test_pairs: dict, # {user_id: [positive_item_ids]}
train_pairs: dict, # {user_id: [interacted_item_ids]} - to exclude
K: int = 20,
device: str = 'cpu',
):
model.eval()
model = model.to(device)
edge_index = edge_index.to(device)
with torch.no_grad():
user_emb, item_emb = model(edge_index)
recalls, ndcgs = [], []
for user_id, pos_items in test_pairs.items():
if not pos_items:
continue
u_vec = user_emb[user_id] # [d]
# Score all items
scores = torch.mv(item_emb, u_vec) # [n_items]
# Mask out training items
for train_item in train_pairs.get(user_id, []):
scores[train_item] = -1e9
# Top-K items
topk = torch.topk(scores, K).indices.cpu().tolist()
# Recall@K
hits = len(set(topk) & set(pos_items))
recall = hits / min(len(pos_items), K)
recalls.append(recall)
# NDCG@K
dcg = 0.0
for rank, item in enumerate(topk):
if item in pos_items:
dcg += 1.0 / np.log2(rank + 2)
ideal_dcg = sum(1.0 / np.log2(r + 2) for r in range(min(len(pos_items), K)))
ndcgs.append(dcg / ideal_dcg if ideal_dcg > 0 else 0.0)
return {
f'Recall@{K}': float(np.mean(recalls)),
f'NDCG@{K}': float(np.mean(ndcgs)),
}
Cold-Start with GNNs
Cold-start is the recommendation problem for new users or items with no interaction history. Standard LightGCN (lookup embeddings only) cannot handle cold-start - there is no embedding to retrieve for a new node.
Item cold-start solutions:
-
Content features as initial embeddings (PinSage approach): use image embeddings, text embeddings, or item metadata as the initial node feature . New items immediately get embeddings from content alone, before any interactions. The GNN then refines these with collaborative signals as interactions accumulate.
-
Dropout-based augmentation (DropoutNet, Volkovs et al. 2017): randomly drop collaborative signals during training, forcing the model to use content features alone. At inference time for cold items, only content features are used.
-
Meta-learning (MAMO, Pan et al. 2019): treat each new user/item as a few-shot learning problem. Train a meta-learner that initializes embeddings from a few interactions.
User cold-start solutions:
- Session-based recommendation: use the first few clicks in a new session to compute an ad-hoc embedding via GNN message passing
- Demographic features: encode user attributes (country, device, signup date) as initial node features
- Popularity fallback: return globally popular items until enough interactions accumulate (typically 5-10 interactions)
Comparison: MF vs NGCF vs LightGCN vs PinSage
| Dimension | MF-BPR | NGCF | LightGCN | PinSage |
|---|---|---|---|---|
| Multi-hop signal | Implicit | Explicit (3 hops) | Explicit (K hops) | Explicit (2 hops) |
| Message function | None | None (topology only) | Weighted agg + dense | |
| Edge features | No | No | No | Yes (content) |
| Cold-start items | No | No | No | Yes (content features) |
| Scale (max tested) | Billions | Millions | Millions | 3 Billion nodes |
| Neighborhood sampling | N/A | Full graph | Full graph | PPR random walk |
| Recall@20 (Gowalla) | 0.1291 | 0.1547 | 0.1830 | N/A (different data) |
| Training complexity | $O( | E | \cdot d)$ | $O(K \cdot |
Production Deployment - Two-Stage Retrieval
In production, GNN-based recommendation follows a two-stage architecture:
Stage 1 - Retrieval: GNN-computed embeddings are indexed in FAISS (Facebook AI Similarity Search) with an IVF or HNSW index. At request time, the user's embedding is used to query the ANN index, retrieving the top-500 most similar items in under 10 milliseconds. GNN embeddings are recomputed offline periodically (daily or weekly) and pushed to the serving index.
Stage 2 - Reranking: The top-500 candidates from retrieval pass through a more expensive ranker with richer features: real-time user context (current session, recent clicks, time of day), item metadata, diversity constraints, business rules (no out-of-stock items, boost new items). This reranker typically runs in 50-100 milliseconds.
The GNN runs entirely offline. The online serving cost is just an ANN lookup plus the reranker - both are highly optimized.
Embedding refresh cadence:
- Pinterest: nightly full re-embedding of all 3B pins
- For dynamic systems: incremental updates using streaming GNNs (GraphSAINT, ROLAND)
Knowledge Graph-Enhanced Recommendation
KGCN (Wang et al., 2019) extends GNN-based recommendation by adding a knowledge graph of item attributes. Instead of only user-item interactions, the graph includes:
- User-item interactions (click, purchase)
- Item-attribute relationships from a KG (movie → genre → action, movie → director → Nolan)
Message passing on this enriched graph enables the model to learn that users who liked a Nolan film tend to like other Nolan films, not just films in the same genre - a more precise semantic similarity than collaborative filtering alone provides.
KG integration is especially valuable for:
- Long-tail items: few interactions but rich attribute graph
- Cross-domain recommendation: transfer across categories via shared KG entities
- Explainability: recommendation paths through the KG provide human-readable explanations
YouTube Resources
| Title | Channel | Focus |
|---|---|---|
| Graph Neural Networks for Recommendations (CS224W) | Stanford Online | GNN recsys theory · LightGCN · PinSage architecture |
| LightGCN Paper Explained | Aleksa Gordic | LightGCN derivation · ablation results · why simplicity wins |
| PinSage: Building Graph Convolutional Networks for Pinterest | Pinterest Engineering | PinSage engineering · curriculum training · production scale |
| Recommender Systems with GNNs | Weights and Biases | PyG implementation walkthrough · BPR loss |
| Graph-Based Recommender Systems at Scale | RecSys Conference | Industrial experience reports · engineering challenges |
Common Pitfalls
:::danger Using validation Recall@K on the training graph without masking When evaluating, you must mask out all training interactions before computing top-K recommendations. If you score items a user already interacted with, you inflate Recall@K - the model trivially retrieves items it was trained on. Always filter training positives from the ranked list before computing any retrieval metric. :::
:::danger Training LightGCN with too many layers on sparse graphs LightGCN with K=3 or K=4 layers works well on dense graphs. On sparse user-item graphs (fewer than 10 interactions per user), deeper propagation causes over-smoothing: user embeddings become indistinguishable because most users share similar sparse neighborhoods. For sparse graphs, use K=1 or K=2 layers. Run ablation on layer count - don't assume more layers means better. :::
:::warning Uniform negative sampling underestimates hard cases Uniform random negatives are too easy - the model quickly learns to score positive items above random items. This leads to fast convergence on easy cases but poor performance on hard cases (similar but non-interacted items). Use popularity-biased negatives (sample proportional to ) or in-batch negatives (treat other users' positive items in the same batch as negatives for the current user). :::
:::warning Offline metrics do not predict online A/B performance reliably Recall@20 and NDCG@10 are computed from historical data. They measure the model's ability to predict held-out historical interactions - not necessarily the interactions users would make if shown the model's recommendations. Online A/B tests regularly show cases where a model with lower offline Recall@20 has higher click-through rate because it surfaces novel, serendipitous items. Always run A/B tests before declaring a GNN improvement "real." :::
Interview Q&A
Q: Why do GNNs outperform matrix factorization for collaborative filtering? What structural information do they capture that MF misses?
A: GNNs explicitly propagate embedding information along the user-item graph, capturing multi-hop collaborative filtering signals. After layers, a user's embedding aggregates information from all users within hops - including users who interacted with the same items, and items those users also liked. MF captures this only implicitly through the shared latent space: two users end up with similar embeddings if they interact with similar items, but there is no direct propagation of structural information. The practical difference shows up on long-tail items: MF fails because long-tail items have few direct interactions, but a GNN can propagate similarity information from structurally adjacent items even when direct interaction count is low. On standard benchmarks like Gowalla and Yelp2018, LightGCN improves Recall@20 by roughly 18% over MF-BPR.
Q: LightGCN removes the weight matrix and nonlinear activation from NGCF and still outperforms it. Why?
A: The ablation study in He et al. (2020) shows that each component removed from NGCF improves performance: removing nonlinearity helps, removing the feature transformation helps, and removing the interaction term also helps. The reasons: (1) In LightGCN, the only learnable parameters are the initial embeddings - not weights in the message function. Adding weight matrices introduces redundant parameters that interact poorly with the symmetric normalization and overfit on smaller training sets. (2) Nonlinearity is unnecessary because the embeddings being propagated are already in a high-dimensional space; stacking linear propagation with linear combination of layer outputs is expressive enough. (3) The task is ranking, not regression - the model does not need to learn a complex functional form, just a good embedding space where users and items that interact are close. Simpler training objectives converge to better ranking solutions.
Q: How does PinSage scale to 3 billion nodes? What is importance-based neighborhood sampling?
A: PinSage uses three key engineering decisions. First, random walk importance sampling: instead of sampling a fixed random neighborhood, PinSage runs random walks of length from each target node and uses the top- most visited nodes as the neighborhood. This approximates Personalized PageRank, giving neighborhoods that are structurally more relevant than random samples. Second, MapReduce offline pipeline: Personalized PageRank neighborhoods are precomputed offline using Apache Spark for all 3B pins and cached to disk. Training reads from cached neighborhoods, never touching the full graph online. Third, content features for initial embeddings: using pretrained image and text embeddings as node features enables cold-start and means the GNN only needs to learn 2 layers of aggregation, not a full node embedding table for 3B items. Inference runs in MapReduce shards, each shard computing embeddings for a subset of pins using neighbors' cached embeddings.
Q: How do you handle the cold-start problem in GNN-based recommendation?
A: Standard LightGCN cannot handle cold-start - new users and items have no position in the interaction graph, so there is no embedding to retrieve. There are three main approaches. Content-based initialization (PinSage): use content features (image embeddings, text features, item metadata) as initial node features. New items immediately get embeddings from content before any interactions. The GNN refines these embeddings as interactions accumulate. Dropout-based regularization (DropoutNet): during training, randomly drop the collaborative signal for some users/items, forcing the model to rely on content features alone. At inference time, cold items use only content features. Meta-learning: treat new users as few-shot learning problems - train a meta-learner that initializes embeddings from 1-5 interactions. In practice, most production systems use a fallback strategy: return globally popular items for users with fewer than 5 interactions, then switch to the full GNN once enough data accumulates.
Q: Design a GNN-based recommendation system for a music streaming service with 50M users and 80M tracks. What are the key architectural decisions?
A: The key decisions and reasoning: (1) Graph construction: build a bipartite user-track interaction graph from listening history. Also add track-track edges from a knowledge graph (same artist, same album, similar audio features from a pretrained audio model). (2) Model choice: LightGCN with K=3 layers for the bipartite graph plus content features for cold-start tracks (new releases). Use contrastive learning augmentation (SimGCL) to handle the noise in listening data - a user can have a track play without actually liking it (passive listening). (3) Negative sampling: popularity-biased negatives (proportional to track play count) plus in-batch hard negatives from the current mini-batch. (4) Scalability: mini-batch training with GraphSAINT or ClusterGCN - the full user-track graph (50M × 80M) does not fit in memory. Run neighborhood sampling per mini-batch. (5) Retrieval: index track embeddings in FAISS with HNSW graph. At inference, retrieve top-500 candidates in under 5ms, rerank with a point-wise ranker that incorporates session context and time-of-day. (6) Evaluation: Recall@10, NDCG@10 on held-out listening events, plus online A/B metrics (stream completion rate, session length).
Q: What is the BPR loss and why is it preferred over binary cross-entropy for recommendation?
A: BPR (Bayesian Personalized Ranking) loss is where is a positive pair and is a negative pair. It directly optimizes pairwise ranking: positive items should score higher than negative items. Binary cross-entropy (BCE) treats each interaction as an independent binary classification: is this user-item pair positive (1) or negative (0)? For recommendation, BCE has two problems. First, implicit feedback is noisy: a non-interaction does not mean the user dislikes the item - they may simply not have been exposed to it. BCE treats all non-interactions as negatives, introducing massive label noise. BPR only requires relative ordering, which is more robust to this noise. Second, BCE optimizes absolute scores, but recommendation only cares about relative ranking - it does not matter whether the top item scores 0.9 or 0.7, only that it scores higher than the 501st item. BPR directly optimizes this objective and converges faster for sparse implicit feedback data.
Temporal Dynamics - Handling a Changing Graph
Real recommendation graphs are dynamic. New users arrive, new items are added, interaction patterns shift with seasons and trends. A model trained on a static snapshot of the graph degrades as the graph evolves.
The Staleness Problem
LightGCN embeddings are precomputed offline and cached. If embeddings are refreshed weekly, any user who signs up on day 1 of the week has stale embeddings by day 7. For high-velocity graphs (e-commerce during sales, news recommendation, Twitter trends), weekly refresh is far too slow.
Solutions:
-
Daily full re-embedding: recompute all embeddings nightly. Acceptable for most use cases. Pinterest refreshes daily.
-
Streaming GNNs (ROLAND, 2022): maintain a persistent model that can be updated incrementally as new edges arrive. Each new interaction triggers a local embedding update for the affected user and item, without recomputing the entire graph.
-
Two-level embedding: keep a "base" embedding from full weekly re-training and an "incremental" embedding from recent interactions that is updated more frequently. Combine both at inference time.
-
Temporal graph attention (TGAT): augment graph attention with timestamps - the weight depends not just on features but on the time elapsed since the interaction. Recent interactions receive higher weight.
Graph Structure Evolution
The topology of the user-item graph also changes. Popular items accumulate edges rapidly, increasing their degree and potentially causing their embeddings to over-dominate aggregation. Monitor:
- Degree distribution shift: if some items go from 100 to 100K interactions, the symmetric normalization provides some protection
- New user cold-start rate: if your user base grows 5% per week, 5% of recommendations will hit the cold-start path
- Item lifecycle: fashion items have short lifecycles, so temporal decay of interaction weights matters more in fashion than in music
Contrastive Learning for GNN Recommendation
A growing research direction augments BPR training with contrastive objectives to learn more robust embeddings. SimGCL (2022) adds uniform noise to graph embeddings to create augmented views and trains with InfoNCE contrastive loss alongside BPR:
where and are two augmented views of the same user embedding (created by adding different random noise), is temperature, and the sum is over all users in the batch.
The contrastive objective encourages each user's embedding to be consistent across augmentations (similar to itself) while being different from other users' embeddings. This reduces popularity bias - popular users dominate BPR loss, but contrastive loss treats all users equally.
SimGCL achieves state-of-the-art results on Gowalla, Yelp2018, and Amazon-Book benchmarks, improving LightGCN's NDCG@20 by 8-12% without any architectural changes - just better training.
Practical Evaluation Setup
Proper evaluation of a GNN-based recommender requires careful attention to data splitting:
def temporal_split(interactions, test_ratio=0.1, val_ratio=0.1):
"""
Temporal split: use the most recent interactions for test/val.
This simulates real deployment - the model never sees future interactions.
Avoid random splits, which leak future data into training.
"""
interactions = sorted(interactions, key=lambda x: x['timestamp'])
n = len(interactions)
n_test = int(n * test_ratio)
n_val = int(n * val_ratio)
train = interactions[:n - n_test - n_val]
val = interactions[n - n_test - n_val : n - n_test]
test = interactions[n - n_test:]
return train, val, test
def leave_one_out_split(user_interactions):
"""
Leave-one-out: for each user, hold out the last interaction for test.
Standard for papers but not recommended for production evaluation
(masks temporal leakage).
"""
train, test = {}, {}
for user_id, items in user_interactions.items():
if len(items) > 1:
train[user_id] = items[:-1]
test[user_id] = [items[-1]]
return train, test
:::tip Use temporal splits, not random splits Random splits are the most common evaluation mistake in recommender systems research. A random split of interactions will assign some future interactions to the training set and some past interactions to the test set - the model sees "future" data during training. This inflates all metrics and does not reflect real deployment. Always use temporal splits: train on all interactions before time , evaluate on interactions at time or later. :::
Summary
GNN-based recommendation is built on the insight that user-item interaction graphs encode multi-hop collaborative filtering signals that matrix factorization cannot explicitly leverage. The evolution from NGCF to LightGCN demonstrated a counterintuitive result: removing learned transformations and nonlinearities from the message passing layers consistently improves recommendation quality, because the task is about learning a good embedding space through graph propagation, not a complex functional mapping.
PinSage solved the engineering problem that academia ignored - how to train and serve a GNN on a graph with 3 billion nodes and 18 billion edges. The solutions (importance-based neighborhood sampling, MapReduce inference, curriculum training, content features for cold-start) became the blueprint for industrial GNN deployment.
In production, GNN embeddings serve as the backbone of two-stage retrieval systems: offline-computed embeddings are indexed in FAISS for fast ANN retrieval, followed by a lightweight online reranker that incorporates real-time context. The GNN never runs at serving time - only at embedding time, making latency requirements achievable at scale.
The next frontier in GNN recommendation is handling temporal dynamics - streaming GNN updates, temporal attention, and contrastive learning augmentation - to keep recommendation quality high as graphs evolve continuously at production scale.
Session-Based Recommendation with GNNs
Traditional collaborative filtering (LightGCN, PinSage) requires historical interactions for each user. Session-based recommendation handles a different scenario: a new or anonymous user produces a sequence of clicks in a single session, and you must recommend the next item based only on that session.
SR-GNN - Session Recommendation with GNNs
Shu Wu et al. (2019) modeled each session as a directed graph. In a session , items are nodes and transitions between consecutive items are directed edges. The edge from to gets weight 2 if the transition occurs twice in the session.
Message passing on this session graph propagates context:
After rounds, a soft-attention readout distinguishes the "current intent" (most recently clicked) from the "general preference" (average of all clicks). The recommendation score for each candidate item is the dot product with this session representation.
Why GNNs over RNNs for sessions: an RNN processes the sequence left-to-right and forgets distant items. A GNN on the session graph can directly propagate between any two items in the session, regardless of when they occurred. If a user clicks item at step 1, then , then again at step 5, the GNN naturally recognizes as high-importance through its high in-degree, whereas an RNN might underweight because of its early position.
Heterogeneous Graph Recommendation
Real recommendation systems have multiple node types and edge types, not just users and items. A movie platform has:
- Node types: Users, Movies, Actors, Directors, Genres
- Edge types: (User, watched, Movie), (Movie, starring, Actor), (Movie, directedBy, Director), (Movie, belongsTo, Genre)
Standard LightGCN cannot handle this - it assumes a homogeneous bipartite graph.
HAN - Heterogeneous Attention Network
Wang et al. (2019) introduced HAN, which performs message passing separately for each edge type (meta-path), then combines the results:
For a user, HAN computes:
- Aggregation along "User-watches-Movie" edges → gets similar users via shared movies
- Aggregation along "User-watches-Movie-Actor" meta-path → gets items with same actors as watched movies
- Aggregation along "User-watches-Movie-Genre" meta-path → genre preferences
Each meta-path aggregation produces a separate embedding. A meta-path attention layer then weights which meta-paths are most relevant for the current user:
where are learned attention weights over meta-paths.
RGCN - Relational GCN
Schlichtkrüll et al. (2018) proposed RGCN, which uses separate weight matrices per relation type:
Each relation type has its own weight matrix . Basis decomposition reduces parameters: (each is a linear combination of shared basis matrices ). RGCN is the multi-relational MPNN equivalent.
Diversity and Serendipity in GNN Recommendation
A pure relevance-maximizing recommender will converge to showing users only items very similar to what they already liked - the "filter bubble" problem. GNN-based systems exacerbate this because graph propagation inherently clusters similar items together in embedding space.
Maximal Marginal Relevance (MMR) re-ranks the top-K retrieved items for diversity:
where is the set of already-selected items and balances relevance vs diversity.
Determinantal Point Processes (DPP) select a set of items that are simultaneously high-relevance and mutually diverse, modeled via the determinant of the kernel matrix:
where and is a similarity kernel. Maximizing the determinant selects diverse, high-relevance subsets. DPPs are used in production at Netflix and Google to avoid filter bubbles.
GNN embeddings naturally serve as the kernel for DPP: where are GNN-computed item embeddings. Items that are close in GNN embedding space (high structural similarity from the interaction graph) are penalized by the DPP kernel, promoting diversity.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.
:::
