What is GNN recommendation?

How LightGCN, PinSage, and NGCF use graph neural networks on user-item interaction graphs to capture multi-hop collaborative filtering signals at billion-scale.

How does LightGCN work in practice?

GNNs for Recommender Systems covers GNN recommendation, LightGCN, PinSage from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/graph-neural-networks/gnns-for-recommendation

What is the difference between GNN recommendation and PinSage?

See the full breakdown at https://engineersofai.com/docs/ml/graph-neural-networks/gnns-for-recommendation

GNNs for Recommender Systems

The Real Interview Moment

"Pinterest has 300 million users, 3 billion pins, and hundreds of billions of engagement signals. Their recommendation system runs in real time. How would you architect a GNN-based recommender for this scale?"

This is not hypothetical. In 2018, the Pinterest engineering team published PinSage - a GNN trained on a graph with 3 billion nodes and 18 billion edges, deployed in production to serve 300 million users. It was the first industrial-scale application of graph neural networks, and it forced the field to think seriously about what it actually takes to deploy GNNs beyond benchmark datasets with a few thousand nodes.

The fundamental insight behind GNN-based recommendation is simple but powerful. Traditional matrix factorization (MF) treats each user and item as an independent embedding to be learned. It captures first-order signals - user $u$ likes item $i$ because they interacted with it. But it misses multi-hop signals: user $u$ likes item $i$ because user $u$ liked item $j$ , and users who liked $j$ also liked $i$ , which is similar to $k$ , which shares properties with $i$ . This is the collaborative filtering signal that made Netflix Prize contestants' hair turn grey - it's real, it matters, and matrix factorization only captures it indirectly through the shared latent space.

A graph neural network on the user-item interaction graph propagates information directly along these multi-hop paths. After $k$ layers, a user embedding aggregates information from $k$ -hop neighborhood - including items-of-items-of-users and users-of-items-of-items. This is structured, principled multi-hop collaborative filtering, not a side effect of matrix factorization's optimization.

The engineering challenge is that user-item graphs are enormous, sparse, and dynamic. New users arrive with no interaction history. New items are added continuously. The full graph never fits in GPU memory. Building a GNN-based recommender that works in this environment requires solving problems that benchmark papers routinely ignore.

Why This Exists - The Limits of Matrix Factorization

Matrix factorization (MF) decomposes the user-item interaction matrix $R \in \mathbb{R}^{M \times N}$ (M users, N items) into user embeddings $P \in \mathbb{R}^{M \times d}$ and item embeddings $Q \in \mathbb{R}^{N \times d}$ , predicting interaction $\hat{r}_{ui} = p_u^T q_i$ .

What MF gets right:

Learns compressed representations of users and items
Captures user taste clusters through the shared latent space
Scalable: $O(M + N)$ parameters

What MF misses:

Multi-hop signals: MF does not explicitly propagate the "user A liked item B which was also liked by users similar to A" signal
Graph structure: the user-item graph has rich topology - hub items (popular), niche items (few interactions but dedicated fans), user clusters - MF sees only the interaction matrix
Higher-order connectivity: items are similar not just through shared users but through chains of shared users through other items

The classical symptom: MF performs well on popular items but poorly on long-tail items, because long-tail items have few direct interactions. A GNN can propagate information from structurally similar items even without direct co-interaction.

Historical Context

The sequence of GNN-based recommenders follows a clear progression:

Year	Paper	Key Idea
2018	PinSage (Ying et al., Pinterest)	First billion-scale GNN in production; random walk neighborhood sampling
2019	NGCF (Wang et al., SIGIR)	Explicit embedding propagation on user-item graph; interaction-based messages
2020	LightGCN (He et al., SIGIR)	Remove feature transform + nonlinearity - just weighted sum; surprisingly beats NGCF
2020	DGCF (Wang et al., SIGIR)	Disentangled intent-based propagation
2021	SimGCL (Yu et al.)	Contrastive learning augmentation for GNN-based recsys
2022	UltraGCN (Mao et al.)	Approximate LightGCN without explicit message passing

PinSage solved the engineering problem. NGCF showed that multi-hop propagation beats pure MF. LightGCN showed that simpler is better - a result that resonated through the entire field.

Graph Construction from Implicit Feedback

In real systems, you rarely have explicit ratings (1-5 stars). You have implicit feedback: clicks, purchases, watch time, saves, repins.

Bipartite Graph Construction

Build a bipartite graph $G = (U \cup I, E)$ where:

$U$ = user nodes
$I$ = item nodes
$E$ = interaction edges ( $(u, i) \in E$ if user $u$ interacted with item $i$ )

No edges exist between user-user or item-item pairs in the raw bipartite graph. Message passing on this graph naturally enables multi-hop collaborative filtering:

1-hop from user $u$ : items $u$ interacted with
2-hop from user $u$ : users who interacted with the same items as $u$ (similar users)
3-hop from user $u$ : items that similar users interacted with (collaborative filtering signal)

Multi-hop intuition: User A and User B both interacted with Item 2. That makes User A and User B structurally similar. User B also interacted with Item 3. Therefore Item 3 is likely relevant to User A - even though User A has never seen Item 3. This is precisely the 3-hop collaborative filtering signal a GNN captures by propagating over the bipartite graph.

Negative Sampling

You need negative examples (user-item pairs the user did not interact with) to train a ranking model. Several strategies:

Uniform random sampling: sample negative item uniformly at random from all non-interacted items. Simple but low signal - most negative samples are obviously irrelevant.

Popularity-biased sampling: sample negative items proportional to their interaction count $p(i^-) \propto n_i^\alpha$ where $\alpha \in [0,1]$ and $n_i$ is item interaction count. Harder negatives: popular items a user didn't interact with despite having opportunity.

Hard negative sampling: mine items with high predicted scores but no interaction - the model's current mistakes. Most informative, but requires periodic recomputation and risks training on false negatives (items the user would like but simply hasn't seen).

PinSage curriculum training: start with easy negatives, progressively introduce harder negatives as training proceeds. Curriculum learning stabilizes training and improves final performance.

NGCF - Neural Graph Collaborative Filtering

Wang et al. (2019) introduced NGCF as the first GNN specifically designed for collaborative filtering with explicit interaction-based messages.

NGCF Message Function

For a user-item edge $(u, i)$ , NGCF computes messages using an element-wise product:

$m_{u \leftarrow i} = \frac{1}{\sqrt{|\mathcal{N}_u| \cdot |\mathcal{N}_i|}} \left(W_1 h_i + W_2 (h_i \odot h_u)\right)$

The $W_2 (h_i \odot h_u)$ term is the key innovation: the message depends on the interaction between the user and item embeddings, not just the item embedding. This allows the model to learn which dimensions of the item embedding are important given the user's current preference state.

NGCF Update

After aggregating messages from all neighbors:

$h_u^{(k+1)} = \text{LeakyReLU}\!\left(W_1 h_u^{(k)} + \sum_{i \in \mathcal{N}_u} m_{u \leftarrow i}^{(k)}\right)$

NGCF Final Prediction

Concatenate embeddings from all $K$ layers:

$e_u^* = h_u^{(0)} \| h_u^{(1)} \| \cdots \| h_u^{(K)}$

Predict interaction score: $\hat{y}_{ui} = (e_u^*)^T e_i^*$

Three propagation layers were found optimal. NGCF beat pure MF and many neural MF variants on standard benchmarks.

LightGCN - Simpler and Better

He et al. (2020) asked a sharp question: NGCF uses feature transformation $W$ and nonlinear activation at every layer. What happens if we remove them?

The answer was surprising: removing the transformation and activation consistently improved performance.

LightGCN Propagation Rule

LightGCN reduces the MPNN to its bare minimum - just weighted sum aggregation:

$e_u^{(k+1)} = \sum_{i \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u||\mathcal{N}_i|}}\, e_i^{(k)}$

$e_i^{(k+1)} = \sum_{u \in \mathcal{N}_i} \frac{1}{\sqrt{|\mathcal{N}_i||\mathcal{N}_u|}}\, e_u^{(k)}$

No weight matrix, no activation function. The normalization $\frac{1}{\sqrt{|\mathcal{N}_u||\mathcal{N}_i|}}$ is the symmetric GCN normalization - fixed, not learned.

LightGCN Final Embedding

Combine representations from all $K$ layers:

$e_u^* = \sum_{k=0}^{K} \alpha_k\, e_u^{(k)}, \quad \alpha_k = \frac{1}{K+1}$

Uniform layer combination (can also be learned). Final prediction:

$\hat{y}_{ui} = (e_u^*)^T e_i^*$

Why Does Removing Complexity Help?

The ablation study in the LightGCN paper is illuminating:

Model	Recall@20	NDCG@10
MF-BPR (no GNN)	0.0454	0.0351
NGCF	0.0579	0.0477
NGCF without W2 (no interaction term)	0.0567	0.0458
NGCF without nonlinearity	0.0591	0.0488
NGCF without both (= LightGCN)	0.0639	0.0525

The feature transformation in NGCF introduces redundant parameters that interact poorly with the symmetric normalization. The nonlinearity does not help because the input embeddings (lookup table) are the only parameters being optimized - nonlinearity at every layer overcomplicates the gradient flow. LightGCN strips the architecture to its essential function: propagate embeddings along the graph to inject multi-hop collaborative signals.

Matrix Form of LightGCN

In matrix form, LightGCN is a simple spectral graph convolution:

$E^{(k+1)} = \tilde{A}\, E^{(k)}$

where $\tilde{A}$ is the symmetrically normalized adjacency matrix of the bipartite graph, and $E^{(k)}$ contains all user and item embeddings at layer $k$ . The final embedding is the mean of $K+1$ layer embeddings:

$E^* = \frac{1}{K+1} \sum_{k=0}^{K} \tilde{A}^k E^{(0)}$

This is a polynomial filter on the graph spectrum - LightGCN is a $K$ -th degree Chebyshev polynomial filter with uniform coefficients, applied to learned embeddings.

PinSage - Billion-Scale GNN at Pinterest

PinSage (Ying et al., 2018) is the industrial GNN that proved graph-based recommendation works at production scale. It operates on a graph with 3 billion pins (content items), 100+ billion edges, and must produce embeddings for new pins in milliseconds.

Importance-Based Neighborhood Sampling

Standard GNN neighborhood sampling (GraphSAGE-style) samples a fixed number of neighbors uniformly. For billion-scale graphs, this still touches enormous subgraphs. PinSage uses random walk importance sampling:

From node $v$ , run $r$ random walks of length $\ell$
Count how many times each node is visited across all walks: $\text{visits}(u)$
The top- $T$ most visited nodes define the neighborhood $\mathcal{N}(v)$

This is an approximation to Personalized PageRank (PPR) - the stationary distribution of a random walk starting at $v$ with restart probability $\alpha$ . PPR naturally assigns higher importance to nodes that are structurally close (nearby in the graph) and popular (many paths lead to them). Using PPR-based neighborhoods instead of random sampling gives better-quality embeddings because high-PPR neighbors are genuinely similar, not just random close-by nodes.

PinSage Architecture

Input: pin content features (image embedding from ResNet, text from TF-IDF)
  ↓
Layer 1: Importance-weighted aggregation from top-50 neighbors
         → concatenate with self
         → linear + batch norm + ReLU
  ↓
Layer 2: Same with 2-hop importance-sampled neighborhoods
  ↓
Output: 256-dimensional L2-normalized pin embedding

PinSage uses content features (image embeddings, text features) as initial node features - not learned lookup embeddings. This enables cold-start: a new pin with no engagement history can still get an embedding from its content features. Contrast with LightGCN, which uses only learned embeddings and cannot handle new items.

MapReduce Pipeline

Training the full model on 3B nodes requires distributed computation:

Offline preprocessing: compute Personalized PageRank neighborhoods for all pins using Apache Spark. Cached to disk.
Mini-batch construction: each training batch samples a set of (pin, positive context, negative) triples. Subgraph is extracted from cached neighbors.
GPU training: mini-batches of subgraphs run on GPU. Model parameters synchronized across workers.
Inference: compute embeddings for all 3B pins in MapReduce - each worker handles a shard, uses the trained GNN to compute embeddings from neighbors' cached embeddings.

Curriculum Training

PinSage trains with progressively harder negatives:

Epoch 1-5: random negatives (any non-interacted pin)
Epoch 6-10: within-batch hard negatives (highest-scoring negatives from current batch)
Epoch 11+: fully mined hard negatives (run inference, find pins closest to query but not interacted with)

This curriculum helps because the model needs to learn basic similarity before it can meaningfully mine hard negatives. Starting with hard negatives from a random model produces noisy gradient signal.

BPR Loss for GNN-Based Recommendation

All three models (NGCF, LightGCN, PinSage) use Bayesian Personalized Ranking (BPR) loss, which directly optimizes the pairwise ranking objective:

For training triple $(u, i, j)$ where user $u$ interacted with item $i$ (positive) but not item $j$ (negative):

$\mathcal{L}_{BPR} = \sum_{(u,i,j) \in \mathcal{O}} -\ln\sigma\!\left(\hat{y}_{ui} - \hat{y}_{uj}\right) + \lambda\|E^{(0)}\|^2$

$\sigma$ is the sigmoid function. The loss pushes $\hat{y}_{ui} > \hat{y}_{uj}$ - the positive item should score higher than the negative. The L2 regularization $\lambda\|E^{(0)}\|^2$ is applied only to the initial embeddings (the only parameters in LightGCN) to prevent overfitting.

Why BPR over binary cross-entropy? Because recommendation is fundamentally a ranking problem. You do not care about the absolute scores, only the ordering. BPR directly optimizes pairwise ranking and converges faster than BCE for sparse implicit feedback.

Full LightGCN Implementation in PyTorch Geometric

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import LGConv
from torch_geometric.utils import degree
import numpy as np
from typing import Optional


class LightGCN(nn.Module):
    """
    LightGCN: Light Graph Convolutional Network for Recommendation.
    He et al., SIGIR 2020.

    Architecture:
    - Learnable user and item embedding table (initial embeddings)
    - K rounds of simple weighted sum propagation on bipartite graph
    - Final embedding = uniform average of all layer embeddings
    - Prediction = dot product of final user and item embeddings
    """

    def __init__(
        self,
        n_users: int,
        n_items: int,
        embed_dim: int = 64,
        n_layers: int = 3,
        reg_lambda: float = 1e-4,
    ):
        super().__init__()
        self.n_users = n_users
        self.n_items = n_items
        self.embed_dim = embed_dim
        self.n_layers = n_layers
        self.reg_lambda = reg_lambda

        # Initial embeddings - the ONLY learnable parameters in LightGCN
        self.user_embedding = nn.Embedding(n_users, embed_dim)
        self.item_embedding = nn.Embedding(n_items, embed_dim)

        # LGConv: simple normalized aggregation (no learned weights)
        self.convs = nn.ModuleList([LGConv() for _ in range(n_layers)])

        self._init_weights()

    def _init_weights(self):
        nn.init.normal_(self.user_embedding.weight, std=0.1)
        nn.init.normal_(self.item_embedding.weight, std=0.1)

    def forward(self, edge_index: torch.Tensor):
        """
        edge_index: [2, E] - bipartite graph edges.
                    Convention: user nodes indexed 0..n_users-1,
                    item nodes indexed n_users..n_users+n_items-1.

        Returns:
            user_emb: [n_users, embed_dim] - final user embeddings
            item_emb: [n_items, embed_dim] - final item embeddings
        """
        # Stack initial embeddings: [n_users + n_items, embed_dim]
        x = torch.cat([
            self.user_embedding.weight,
            self.item_embedding.weight,
        ], dim=0)

        # Collect embeddings at each layer
        all_embs = [x]

        for conv in self.convs:
            x = conv(x, edge_index)   # normalized sum aggregation
            all_embs.append(x)

        # Final embedding: uniform mean across all layers (eq. 11 in paper)
        stacked = torch.stack(all_embs, dim=1)          # [N, K+1, d]
        final_emb = stacked.mean(dim=1)                  # [N, d]

        # Split back into user and item embeddings
        user_emb = final_emb[:self.n_users]              # [n_users, d]
        item_emb = final_emb[self.n_users:]              # [n_items, d]

        return user_emb, item_emb

    def predict(self, user_ids, item_ids, edge_index):
        """Predict scores for (user, item) pairs."""
        user_emb, item_emb = self.forward(edge_index)
        u = user_emb[user_ids]   # [B, d]
        i = item_emb[item_ids]   # [B, d]
        return (u * i).sum(dim=-1)  # [B]

    def bpr_loss(self, users, pos_items, neg_items, user_emb, item_emb):
        """
        BPR loss with L2 regularization on initial embeddings.

        users:     [B] - user indices
        pos_items: [B] - positive item indices
        neg_items: [B] - negative item indices
        """
        u_emb = user_emb[users]
        pos_emb = item_emb[pos_items]
        neg_emb = item_emb[neg_items]

        # Pairwise scores
        pos_scores = (u_emb * pos_emb).sum(dim=-1)  # [B]
        neg_scores = (u_emb * neg_emb).sum(dim=-1)  # [B]

        # BPR loss
        bpr = -F.logsigmoid(pos_scores - neg_scores).mean()

        # L2 regularization on INITIAL embeddings (not propagated)
        u_init = self.user_embedding(users)
        pi_init = self.item_embedding(pos_items)
        ni_init = self.item_embedding(neg_items)
        reg = self.reg_lambda * (
            u_init.norm(2).pow(2) +
            pi_init.norm(2).pow(2) +
            ni_init.norm(2).pow(2)
        ) / len(users)

        return bpr + reg


# ──────────────────────────────────────────────
# Training loop
# ──────────────────────────────────────────────

def build_bipartite_edge_index(
    user_item_pairs: list,  # [(user_id, item_id), ...]
    n_users: int,
) -> torch.Tensor:
    """
    Build bidirectional edge index for bipartite graph.
    Item nodes are offset by n_users.
    """
    users = [u for u, i in user_item_pairs]
    items = [i + n_users for u, i in user_item_pairs]

    # Bidirectional: user→item and item→user
    src = torch.tensor(users + items, dtype=torch.long)
    dst = torch.tensor(items + users, dtype=torch.long)
    return torch.stack([src, dst], dim=0)  # [2, 2*len(pairs)]


def train_lightgcn(
    train_pairs: list,
    n_users: int,
    n_items: int,
    embed_dim: int = 64,
    n_layers: int = 3,
    epochs: int = 100,
    batch_size: int = 2048,
    lr: float = 1e-3,
):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    model = LightGCN(n_users, n_items, embed_dim, n_layers).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    edge_index = build_bipartite_edge_index(train_pairs, n_users).to(device)

    # Build positive item set per user for negative sampling
    user_pos_items = {}
    for u, i in train_pairs:
        user_pos_items.setdefault(u, set()).add(i)

    train_users = torch.tensor([u for u, _ in train_pairs], dtype=torch.long)
    train_items = torch.tensor([i for _, i in train_pairs], dtype=torch.long)

    best_loss = float('inf')
    for epoch in range(epochs):
        model.train()
        # Shuffle training pairs
        perm = torch.randperm(len(train_pairs))
        train_users = train_users[perm]
        train_items = train_items[perm]

        total_loss = 0.0
        n_batches = 0

        for start in range(0, len(train_pairs), batch_size):
            end = min(start + batch_size, len(train_pairs))
            batch_users = train_users[start:end].to(device)
            batch_pos   = train_items[start:end].to(device)

            # Popularity-biased negative sampling (simple version)
            batch_neg = torch.randint(0, n_items, (end - start,), device=device)

            # Forward pass
            user_emb, item_emb = model(edge_index)

            loss = model.bpr_loss(batch_users, batch_pos, batch_neg,
                                  user_emb, item_emb)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            n_batches += 1

        avg_loss = total_loss / n_batches
        if avg_loss < best_loss:
            best_loss = avg_loss
            torch.save(model.state_dict(), 'best_lightgcn.pt')

        if epoch % 10 == 0:
            print(f"Epoch {epoch:3d}  BPR Loss: {avg_loss:.4f}  Best: {best_loss:.4f}")

    return model


# ──────────────────────────────────────────────
# Evaluation: Recall@K and NDCG@K
# ──────────────────────────────────────────────

def evaluate_lightgcn(
    model: LightGCN,
    edge_index: torch.Tensor,
    test_pairs: dict,          # {user_id: [positive_item_ids]}
    train_pairs: dict,         # {user_id: [interacted_item_ids]} - to exclude
    K: int = 20,
    device: str = 'cpu',
):
    model.eval()
    model = model.to(device)
    edge_index = edge_index.to(device)

    with torch.no_grad():
        user_emb, item_emb = model(edge_index)

    recalls, ndcgs = [], []

    for user_id, pos_items in test_pairs.items():
        if not pos_items:
            continue

        u_vec = user_emb[user_id]  # [d]

        # Score all items
        scores = torch.mv(item_emb, u_vec)  # [n_items]

        # Mask out training items
        for train_item in train_pairs.get(user_id, []):
            scores[train_item] = -1e9

        # Top-K items
        topk = torch.topk(scores, K).indices.cpu().tolist()

        # Recall@K
        hits = len(set(topk) & set(pos_items))
        recall = hits / min(len(pos_items), K)
        recalls.append(recall)

        # NDCG@K
        dcg = 0.0
        for rank, item in enumerate(topk):
            if item in pos_items:
                dcg += 1.0 / np.log2(rank + 2)
        ideal_dcg = sum(1.0 / np.log2(r + 2) for r in range(min(len(pos_items), K)))
        ndcgs.append(dcg / ideal_dcg if ideal_dcg > 0 else 0.0)

    return {
        f'Recall@{K}': float(np.mean(recalls)),
        f'NDCG@{K}':   float(np.mean(ndcgs)),
    }

Cold-Start with GNNs

Cold-start is the recommendation problem for new users or items with no interaction history. Standard LightGCN (lookup embeddings only) cannot handle cold-start - there is no embedding to retrieve for a new node.

Item cold-start solutions:

Content features as initial embeddings (PinSage approach): use image embeddings, text embeddings, or item metadata as the initial node feature $x_v$ . New items immediately get embeddings from content alone, before any interactions. The GNN then refines these with collaborative signals as interactions accumulate.
Dropout-based augmentation (DropoutNet, Volkovs et al. 2017): randomly drop collaborative signals during training, forcing the model to use content features alone. At inference time for cold items, only content features are used.
Meta-learning (MAMO, Pan et al. 2019): treat each new user/item as a few-shot learning problem. Train a meta-learner that initializes embeddings from a few interactions.

User cold-start solutions:

Session-based recommendation: use the first few clicks in a new session to compute an ad-hoc embedding via GNN message passing
Demographic features: encode user attributes (country, device, signup date) as initial node features
Popularity fallback: return globally popular items until enough interactions accumulate (typically 5-10 interactions)

Comparison: MF vs NGCF vs LightGCN vs PinSage

Dimension	MF-BPR	NGCF	LightGCN	PinSage
Multi-hop signal	Implicit	Explicit (3 hops)	Explicit (K hops)	Explicit (2 hops)
Message function	None	$W_1 h_i + W_2(h_i \odot h_u)$	None (topology only)	Weighted agg + dense
Edge features	No	No	No	Yes (content)
Cold-start items	No	No	No	Yes (content features)
Scale (max tested)	Billions	Millions	Millions	3 Billion nodes
Neighborhood sampling	N/A	Full graph	Full graph	PPR random walk
Recall@20 (Gowalla)	0.1291	0.1547	0.1830	N/A (different data)
Training complexity	$O(	E	\cdot d)$	$O(K \cdot

Production Deployment - Two-Stage Retrieval

In production, GNN-based recommendation follows a two-stage architecture:

Stage 1 - Retrieval: GNN-computed embeddings are indexed in FAISS (Facebook AI Similarity Search) with an IVF or HNSW index. At request time, the user's embedding is used to query the ANN index, retrieving the top-500 most similar items in under 10 milliseconds. GNN embeddings are recomputed offline periodically (daily or weekly) and pushed to the serving index.

Stage 2 - Reranking: The top-500 candidates from retrieval pass through a more expensive ranker with richer features: real-time user context (current session, recent clicks, time of day), item metadata, diversity constraints, business rules (no out-of-stock items, boost new items). This reranker typically runs in 50-100 milliseconds.

The GNN runs entirely offline. The online serving cost is just an ANN lookup plus the reranker - both are highly optimized.

Embedding refresh cadence:

Pinterest: nightly full re-embedding of all 3B pins
For dynamic systems: incremental updates using streaming GNNs (GraphSAINT, ROLAND)

Knowledge Graph-Enhanced Recommendation

KGCN (Wang et al., 2019) extends GNN-based recommendation by adding a knowledge graph of item attributes. Instead of only user-item interactions, the graph includes:

User-item interactions (click, purchase)
Item-attribute relationships from a KG (movie → genre → action, movie → director → Nolan)

Message passing on this enriched graph enables the model to learn that users who liked a Nolan film tend to like other Nolan films, not just films in the same genre - a more precise semantic similarity than collaborative filtering alone provides.

KG integration is especially valuable for:

Long-tail items: few interactions but rich attribute graph
Cross-domain recommendation: transfer across categories via shared KG entities
Explainability: recommendation paths through the KG provide human-readable explanations

YouTube Resources

Title	Channel	Focus
Graph Neural Networks for Recommendations (CS224W)	Stanford Online	GNN recsys theory · LightGCN · PinSage architecture
LightGCN Paper Explained	Aleksa Gordic	LightGCN derivation · ablation results · why simplicity wins
PinSage: Building Graph Convolutional Networks for Pinterest	Pinterest Engineering	PinSage engineering · curriculum training · production scale
Recommender Systems with GNNs	Weights and Biases	PyG implementation walkthrough · BPR loss
Graph-Based Recommender Systems at Scale	RecSys Conference	Industrial experience reports · engineering challenges

Common Pitfalls

:::danger Using validation Recall@K on the training graph without masking When evaluating, you must mask out all training interactions before computing top-K recommendations. If you score items a user already interacted with, you inflate Recall@K - the model trivially retrieves items it was trained on. Always filter training positives from the ranked list before computing any retrieval metric. :::

:::danger Training LightGCN with too many layers on sparse graphs LightGCN with K=3 or K=4 layers works well on dense graphs. On sparse user-item graphs (fewer than 10 interactions per user), deeper propagation causes over-smoothing: user embeddings become indistinguishable because most users share similar sparse neighborhoods. For sparse graphs, use K=1 or K=2 layers. Run ablation on layer count - don't assume more layers means better. :::

:::warning Uniform negative sampling underestimates hard cases Uniform random negatives are too easy - the model quickly learns to score positive items above random items. This leads to fast convergence on easy cases but poor performance on hard cases (similar but non-interacted items). Use popularity-biased negatives (sample proportional to $n_i^{0.75}$ ) or in-batch negatives (treat other users' positive items in the same batch as negatives for the current user). :::

:::warning Offline metrics do not predict online A/B performance reliably Recall@20 and NDCG@10 are computed from historical data. They measure the model's ability to predict held-out historical interactions - not necessarily the interactions users would make if shown the model's recommendations. Online A/B tests regularly show cases where a model with lower offline Recall@20 has higher click-through rate because it surfaces novel, serendipitous items. Always run A/B tests before declaring a GNN improvement "real." :::

Interview Q&A

Q: Why do GNNs outperform matrix factorization for collaborative filtering? What structural information do they capture that MF misses?

A: GNNs explicitly propagate embedding information along the user-item graph, capturing multi-hop collaborative filtering signals. After $K$ layers, a user's embedding aggregates information from all users within $K$ hops - including users who interacted with the same items, and items those users also liked. MF captures this only implicitly through the shared latent space: two users end up with similar embeddings if they interact with similar items, but there is no direct propagation of structural information. The practical difference shows up on long-tail items: MF fails because long-tail items have few direct interactions, but a GNN can propagate similarity information from structurally adjacent items even when direct interaction count is low. On standard benchmarks like Gowalla and Yelp2018, LightGCN improves Recall@20 by roughly 18% over MF-BPR.

Q: LightGCN removes the weight matrix and nonlinear activation from NGCF and still outperforms it. Why?

A: The ablation study in He et al. (2020) shows that each component removed from NGCF improves performance: removing nonlinearity helps, removing the feature transformation $W$ helps, and removing the interaction term $W_2(h_i \odot h_u)$ also helps. The reasons: (1) In LightGCN, the only learnable parameters are the initial embeddings - not weights in the message function. Adding weight matrices introduces redundant parameters that interact poorly with the symmetric normalization and overfit on smaller training sets. (2) Nonlinearity is unnecessary because the embeddings being propagated are already in a high-dimensional space; stacking linear propagation with linear combination of layer outputs is expressive enough. (3) The task is ranking, not regression - the model does not need to learn a complex functional form, just a good embedding space where users and items that interact are close. Simpler training objectives converge to better ranking solutions.

Q: How does PinSage scale to 3 billion nodes? What is importance-based neighborhood sampling?

A: PinSage uses three key engineering decisions. First, random walk importance sampling: instead of sampling a fixed random neighborhood, PinSage runs $r=50$ random walks of length $\ell=100$ from each target node and uses the top- $T$ most visited nodes as the neighborhood. This approximates Personalized PageRank, giving neighborhoods that are structurally more relevant than random samples. Second, MapReduce offline pipeline: Personalized PageRank neighborhoods are precomputed offline using Apache Spark for all 3B pins and cached to disk. Training reads from cached neighborhoods, never touching the full graph online. Third, content features for initial embeddings: using pretrained image and text embeddings as node features enables cold-start and means the GNN only needs to learn 2 layers of aggregation, not a full node embedding table for 3B items. Inference runs in MapReduce shards, each shard computing embeddings for a subset of pins using neighbors' cached embeddings.

Q: How do you handle the cold-start problem in GNN-based recommendation?

A: Standard LightGCN cannot handle cold-start - new users and items have no position in the interaction graph, so there is no embedding to retrieve. There are three main approaches. Content-based initialization (PinSage): use content features (image embeddings, text features, item metadata) as initial node features. New items immediately get embeddings from content before any interactions. The GNN refines these embeddings as interactions accumulate. Dropout-based regularization (DropoutNet): during training, randomly drop the collaborative signal for some users/items, forcing the model to rely on content features alone. At inference time, cold items use only content features. Meta-learning: treat new users as few-shot learning problems - train a meta-learner that initializes embeddings from 1-5 interactions. In practice, most production systems use a fallback strategy: return globally popular items for users with fewer than 5 interactions, then switch to the full GNN once enough data accumulates.

Q: Design a GNN-based recommendation system for a music streaming service with 50M users and 80M tracks. What are the key architectural decisions?

A: The key decisions and reasoning: (1) Graph construction: build a bipartite user-track interaction graph from listening history. Also add track-track edges from a knowledge graph (same artist, same album, similar audio features from a pretrained audio model). (2) Model choice: LightGCN with K=3 layers for the bipartite graph plus content features for cold-start tracks (new releases). Use contrastive learning augmentation (SimGCL) to handle the noise in listening data - a user can have a track play without actually liking it (passive listening). (3) Negative sampling: popularity-biased negatives (proportional to track play count) plus in-batch hard negatives from the current mini-batch. (4) Scalability: mini-batch training with GraphSAINT or ClusterGCN - the full user-track graph (50M × 80M) does not fit in memory. Run neighborhood sampling per mini-batch. (5) Retrieval: index track embeddings in FAISS with HNSW graph. At inference, retrieve top-500 candidates in under 5ms, rerank with a point-wise ranker that incorporates session context and time-of-day. (6) Evaluation: Recall@10, NDCG@10 on held-out listening events, plus online A/B metrics (stream completion rate, session length).

Q: What is the BPR loss and why is it preferred over binary cross-entropy for recommendation?

A: BPR (Bayesian Personalized Ranking) loss is $\mathcal{L} = \sum_{(u,i,j)} -\ln\sigma(\hat{y}_{ui} - \hat{y}_{uj})$ where $(u, i)$ is a positive pair and $(u, j)$ is a negative pair. It directly optimizes pairwise ranking: positive items should score higher than negative items. Binary cross-entropy (BCE) treats each interaction as an independent binary classification: is this user-item pair positive (1) or negative (0)? For recommendation, BCE has two problems. First, implicit feedback is noisy: a non-interaction does not mean the user dislikes the item - they may simply not have been exposed to it. BCE treats all non-interactions as negatives, introducing massive label noise. BPR only requires relative ordering, which is more robust to this noise. Second, BCE optimizes absolute scores, but recommendation only cares about relative ranking - it does not matter whether the top item scores 0.9 or 0.7, only that it scores higher than the 501st item. BPR directly optimizes this objective and converges faster for sparse implicit feedback data.

Temporal Dynamics - Handling a Changing Graph

Real recommendation graphs are dynamic. New users arrive, new items are added, interaction patterns shift with seasons and trends. A model trained on a static snapshot of the graph degrades as the graph evolves.

The Staleness Problem

LightGCN embeddings are precomputed offline and cached. If embeddings are refreshed weekly, any user who signs up on day 1 of the week has stale embeddings by day 7. For high-velocity graphs (e-commerce during sales, news recommendation, Twitter trends), weekly refresh is far too slow.

Solutions:

Daily full re-embedding: recompute all embeddings nightly. Acceptable for most use cases. Pinterest refreshes daily.
Streaming GNNs (ROLAND, 2022): maintain a persistent model that can be updated incrementally as new edges arrive. Each new interaction triggers a local embedding update for the affected user and item, without recomputing the entire graph.
Two-level embedding: keep a "base" embedding from full weekly re-training and an "incremental" embedding from recent interactions that is updated more frequently. Combine both at inference time.
Temporal graph attention (TGAT): augment graph attention with timestamps - the weight $\alpha_{vu}$ depends not just on features but on the time elapsed since the interaction. Recent interactions receive higher weight.

Graph Structure Evolution

The topology of the user-item graph also changes. Popular items accumulate edges rapidly, increasing their degree and potentially causing their embeddings to over-dominate aggregation. Monitor:

Degree distribution shift: if some items go from 100 to 100K interactions, the symmetric normalization $\frac{1}{\sqrt{d_u d_i}}$ provides some protection
New user cold-start rate: if your user base grows 5% per week, 5% of recommendations will hit the cold-start path
Item lifecycle: fashion items have short lifecycles, so temporal decay of interaction weights matters more in fashion than in music

Contrastive Learning for GNN Recommendation

A growing research direction augments BPR training with contrastive objectives to learn more robust embeddings. SimGCL (2022) adds uniform noise to graph embeddings to create augmented views and trains with InfoNCE contrastive loss alongside BPR:

$\mathcal{L}_{CL} = \sum_{u} -\log \frac{\exp(\text{sim}(z_u, z_u') / \tau)}{\sum_{v} \exp(\text{sim}(z_u, z_v') / \tau)}$

where $z_u$ and $z_u'$ are two augmented views of the same user embedding (created by adding different random noise), $\tau$ is temperature, and the sum is over all users in the batch.

The contrastive objective encourages each user's embedding to be consistent across augmentations (similar to itself) while being different from other users' embeddings. This reduces popularity bias - popular users dominate BPR loss, but contrastive loss treats all users equally.

SimGCL achieves state-of-the-art results on Gowalla, Yelp2018, and Amazon-Book benchmarks, improving LightGCN's NDCG@20 by 8-12% without any architectural changes - just better training.

Practical Evaluation Setup

Proper evaluation of a GNN-based recommender requires careful attention to data splitting:

def temporal_split(interactions, test_ratio=0.1, val_ratio=0.1):
    """
    Temporal split: use the most recent interactions for test/val.
    This simulates real deployment - the model never sees future interactions.
    Avoid random splits, which leak future data into training.
    """
    interactions = sorted(interactions, key=lambda x: x['timestamp'])
    n = len(interactions)
    n_test = int(n * test_ratio)
    n_val  = int(n * val_ratio)
    train = interactions[:n - n_test - n_val]
    val   = interactions[n - n_test - n_val : n - n_test]
    test  = interactions[n - n_test:]
    return train, val, test


def leave_one_out_split(user_interactions):
    """
    Leave-one-out: for each user, hold out the last interaction for test.
    Standard for papers but not recommended for production evaluation
    (masks temporal leakage).
    """
    train, test = {}, {}
    for user_id, items in user_interactions.items():
        if len(items) > 1:
            train[user_id] = items[:-1]
            test[user_id]  = [items[-1]]
    return train, test

:::tip Use temporal splits, not random splits Random splits are the most common evaluation mistake in recommender systems research. A random split of interactions will assign some future interactions to the training set and some past interactions to the test set - the model sees "future" data during training. This inflates all metrics and does not reflect real deployment. Always use temporal splits: train on all interactions before time $T$ , evaluate on interactions at time $T$ or later. :::

Summary

GNN-based recommendation is built on the insight that user-item interaction graphs encode multi-hop collaborative filtering signals that matrix factorization cannot explicitly leverage. The evolution from NGCF to LightGCN demonstrated a counterintuitive result: removing learned transformations and nonlinearities from the message passing layers consistently improves recommendation quality, because the task is about learning a good embedding space through graph propagation, not a complex functional mapping.

PinSage solved the engineering problem that academia ignored - how to train and serve a GNN on a graph with 3 billion nodes and 18 billion edges. The solutions (importance-based neighborhood sampling, MapReduce inference, curriculum training, content features for cold-start) became the blueprint for industrial GNN deployment.

In production, GNN embeddings serve as the backbone of two-stage retrieval systems: offline-computed embeddings are indexed in FAISS for fast ANN retrieval, followed by a lightweight online reranker that incorporates real-time context. The GNN never runs at serving time - only at embedding time, making latency requirements achievable at scale.

The next frontier in GNN recommendation is handling temporal dynamics - streaming GNN updates, temporal attention, and contrastive learning augmentation - to keep recommendation quality high as graphs evolve continuously at production scale.

Session-Based Recommendation with GNNs

Traditional collaborative filtering (LightGCN, PinSage) requires historical interactions for each user. Session-based recommendation handles a different scenario: a new or anonymous user produces a sequence of clicks in a single session, and you must recommend the next item based only on that session.

SR-GNN - Session Recommendation with GNNs

Shu Wu et al. (2019) modeled each session as a directed graph. In a session $[v_1, v_2, v_3, v_2, v_4]$ , items are nodes and transitions between consecutive items are directed edges. The edge from $v_2$ to $v_3$ gets weight 2 if the transition $v_2 \to v_3$ occurs twice in the session.

Message passing on this session graph propagates context:

$h_v^{(k+1)} = \text{GRU}\!\left(h_v^{(k)},\, \sum_{u \in \mathcal{N}(v)} \alpha_{uv}^{(k)} h_u^{(k)}\right)$

After $K$ rounds, a soft-attention readout distinguishes the "current intent" (most recently clicked) from the "general preference" (average of all clicks). The recommendation score for each candidate item is the dot product with this session representation.

Why GNNs over RNNs for sessions: an RNN processes the sequence left-to-right and forgets distant items. A GNN on the session graph can directly propagate between any two items in the session, regardless of when they occurred. If a user clicks item $A$ at step 1, then $B, C, D$ , then $A$ again at step 5, the GNN naturally recognizes $A$ as high-importance through its high in-degree, whereas an RNN might underweight $A$ because of its early position.

Heterogeneous Graph Recommendation

Real recommendation systems have multiple node types and edge types, not just users and items. A movie platform has:

Node types: Users, Movies, Actors, Directors, Genres
Edge types: (User, watched, Movie), (Movie, starring, Actor), (Movie, directedBy, Director), (Movie, belongsTo, Genre)

Standard LightGCN cannot handle this - it assumes a homogeneous bipartite graph.

HAN - Heterogeneous Attention Network

Wang et al. (2019) introduced HAN, which performs message passing separately for each edge type (meta-path), then combines the results:

For a user, HAN computes:

Aggregation along "User-watches-Movie" edges → gets similar users via shared movies
Aggregation along "User-watches-Movie-Actor" meta-path → gets items with same actors as watched movies
Aggregation along "User-watches-Movie-Genre" meta-path → genre preferences

Each meta-path aggregation produces a separate embedding. A meta-path attention layer then weights which meta-paths are most relevant for the current user:

$h_u^{final} = \sum_{m} \beta_m \cdot h_u^{(meta-path\ m)}$

where $\beta_m$ are learned attention weights over meta-paths.

RGCN - Relational GCN

Schlichtkrüll et al. (2018) proposed RGCN, which uses separate weight matrices per relation type:

$h_v^{(k+1)} = \sigma\!\left(W_0^{(k)} h_v^{(k)} + \sum_{r \in \mathcal{R}} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{|\mathcal{N}_r(v)|} W_r^{(k)} h_u^{(k)}\right)$

Each relation type $r$ has its own weight matrix $W_r$ . Basis decomposition reduces parameters: $W_r = \sum_b a_{rb} V_b$ (each $W_r$ is a linear combination of $B$ shared basis matrices $V_b$ ). RGCN is the multi-relational MPNN equivalent.

Diversity and Serendipity in GNN Recommendation

A pure relevance-maximizing recommender will converge to showing users only items very similar to what they already liked - the "filter bubble" problem. GNN-based systems exacerbate this because graph propagation inherently clusters similar items together in embedding space.

Maximal Marginal Relevance (MMR) re-ranks the top-K retrieved items for diversity:

$\text{MMR}(i) = \lambda \cdot \text{relevance}(u, i) - (1-\lambda) \cdot \max_{j \in S} \text{sim}(i, j)$

where $S$ is the set of already-selected items and $\lambda$ balances relevance vs diversity.

Determinantal Point Processes (DPP) select a set of items that are simultaneously high-relevance and mutually diverse, modeled via the determinant of the kernel matrix:

$P(S) \propto \det(L_S)$

where $L_{ij} = \text{relevance}(u, i) \cdot k(i, j) \cdot \text{relevance}(u, j)$ and $k(i, j)$ is a similarity kernel. Maximizing the determinant selects diverse, high-relevance subsets. DPPs are used in production at Netflix and Google to avoid filter bubbles.

GNN embeddings naturally serve as the kernel for DPP: $k(i, j) = \exp(-\|e_i - e_j\|^2 / \sigma^2)$ where $e_i$ are GNN-computed item embeddings. Items that are close in GNN embedding space (high structural similarity from the interaction graph) are penalized by the DPP kernel, promoting diversity.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Limits of Matrix Factorization​

Historical Context​

Graph Construction from Implicit Feedback​

Bipartite Graph Construction​

Negative Sampling​

NGCF - Neural Graph Collaborative Filtering​

NGCF Message Function​

NGCF Update​

NGCF Final Prediction​

LightGCN - Simpler and Better​

LightGCN Propagation Rule​

LightGCN Final Embedding​

Why Does Removing Complexity Help?​

Matrix Form of LightGCN​

PinSage - Billion-Scale GNN at Pinterest​

Importance-Based Neighborhood Sampling​

PinSage Architecture​

MapReduce Pipeline​

Curriculum Training​

BPR Loss for GNN-Based Recommendation​

Full LightGCN Implementation in PyTorch Geometric​

Cold-Start with GNNs​

Comparison: MF vs NGCF vs LightGCN vs PinSage​

Production Deployment - Two-Stage Retrieval​

Knowledge Graph-Enhanced Recommendation​

YouTube Resources​

Common Pitfalls​

Interview Q&A​

Temporal Dynamics - Handling a Changing Graph​

The Staleness Problem​

Graph Structure Evolution​

Contrastive Learning for GNN Recommendation​

Practical Evaluation Setup​

Summary​

Session-Based Recommendation with GNNs​

SR-GNN - Session Recommendation with GNNs​

Heterogeneous Graph Recommendation​

HAN - Heterogeneous Attention Network​

RGCN - Relational GCN​

Diversity and Serendipity in GNN Recommendation​