Why Graphs for ML
The Real Interview Moment
You are interviewing for an ML engineer role at a fintech company. The interviewer describes a fraud detection problem: each transaction has 47 features - amount, merchant category, device type, location, time. You have trained a gradient boosted tree that catches 80% of fraud with 0.1% false positives.
"Good," they say. "Now. We have a new signal. Fraudulent accounts often share devices, phone numbers, or shipping addresses with other fraudulent accounts - forming rings. How would you use that?"
You pause. This is relational information. The tree model has no concept of connections between accounts. You would need to join on device ID, aggregate features from the linked accounts, engineer "number of flagged accounts sharing this device" - laboriously, statically, and without capturing multi-hop patterns. If a fraudulent account is separated from another by two intermediate clean-looking accounts, your hand-engineered feature misses it entirely.
Graph neural networks solve this directly. A GNN operating on the account-device-address graph propagates fraud signals through connections automatically. An account linked (through a device) to (through an IP address) to a known fraudster gets a contaminated embedding within two message-passing steps. No feature engineering required. No manually specified hop depth. The model learns how far to look and what to aggregate.
This is the core promise of graph machine learning: move beyond the independent-rows assumption of tabular ML and operate directly on the structure of real-world relational systems.
Why This Exists - The Failure of Tabular ML on Relational Data
Tabular machine learning is built on a foundational assumption: each row is an independent and identically distributed (i.i.d.) sample drawn from the same distribution. The features describe what an entity is, not how it relates to other entities.
This assumption holds for many classical problems - predicting house prices given square footage and location, classifying emails given word frequencies, forecasting demand given historical sales. In these cases, the relationship between rows is irrelevant (or encoded indirectly through shared categorical variables).
But for a vast class of real-world problems, the critical information lives not in individual entities but in the connections between them. A drug molecule is not just a bag of atoms - it is a specific arrangement of atoms connected by bonds of specific types. A paper in a citation network is not just a bag of words - its authority depends on what other papers cite it and what those papers say. A user in a social network is not just a demographic profile - their purchasing behavior is shaped by their friends' purchases.
When relational information matters, tabular ML forces you to hand-engineer relational features: "number of friends who purchased item X," "average rating from users within 2 hops," "number of shared devices with known fraudsters." This engineering is manual, static, and limited to whatever hops and aggregations you thought to include. You must decide the structure of the feature before training, without any signal from the data about which relational patterns are actually predictive.
Graph ML learns these patterns automatically from the structure of the graph itself.
Graph Structure Formalism
A graph consists of:
- Nodes - entities (users, atoms, papers, accounts). is the number of nodes.
- Edges - relationships (friendship, bond, citation, transaction). is the number of edges.
- Node feature matrix - each row is the feature vector of node .
- Edge feature matrix - optional features on each edge.
- Graph-level label - for graph classification tasks.
Molecular graph example (aspirin):
O
‖
C - O - H (ester group)
|
C === C (aromatic ring)
/ \ / \
C C=C C
\ | \ /
C=C (aromatic)
Nodes: Carbon (C), Oxygen (O), Hydrogen (H)
Node features: atomic number, charge, hybridization, aromaticity flag
Edges: chemical bonds
Edge features: bond type (single/double/aromatic), bond length (Å), bond angle
The Adjacency Matrix
The adjacency matrix encodes connectivity:
For an undirected graph, is symmetric: .
Example - triangle graph (nodes 0, 1, 2, all connected):
The degree of node is - the number of edges incident to node .
The degree matrix is diagonal: .
The Graph Laplacian
The unnormalized graph Laplacian is:
For the triangle graph:
The Laplacian has elegant properties:
- Symmetric and positive semidefinite (all eigenvalues )
- Smallest eigenvalue is 0 (with eigenvector )
- Number of zero eigenvalues equals the number of connected components
- Quadratic form: - measures signal variation across edges
The normalized Laplacian stabilizes eigenvalues to :
This is the foundation of spectral graph theory: the eigendecomposition defines the "graph Fourier transform," analogous to the DFT for regular signals. Convolution in the graph frequency domain is multiplication by .
import numpy as np
def compute_laplacian(A, normalized=True):
"""Compute graph Laplacian from adjacency matrix."""
n = A.shape[0]
degrees = A.sum(axis=1) # d_i = sum of row i
D = np.diag(degrees) # degree matrix
L = D - A # unnormalized Laplacian
if not normalized:
return L
# Normalized Laplacian: L_sym = D^{-1/2} L D^{-1/2}
d_inv_sqrt = np.where(degrees > 0, 1.0 / np.sqrt(degrees), 0.0)
D_inv_sqrt = np.diag(d_inv_sqrt)
L_sym = D_inv_sqrt @ L @ D_inv_sqrt
return L_sym
# Triangle graph
A = np.array([[0,1,1],[1,0,1],[1,1,0]], dtype=float)
L = compute_laplacian(A, normalized=False)
L_sym = compute_laplacian(A, normalized=True)
# Eigenvalues
eigenvalues = np.linalg.eigvalsh(L_sym)
print("Eigenvalues of normalized Laplacian:", eigenvalues)
# [0.0, 1.0, 1.5] (always in [0, 2])
Graph Types
Real-world graphs come in many flavors. Choosing the right GNN depends on understanding which type you are dealing with.
Directed vs Undirected
Undirected: edges have no direction. If , then . Friendships, molecular bonds, protein interactions.
Directed: edges have a direction. does not imply . Twitter follows (A follows B ≠ B follows A), citations (paper A cites paper B).
The adjacency matrix of a directed graph is not symmetric. Message passing must distinguish in-edges from out-edges.
Bipartite Graphs
Two disjoint node sets and ; edges only between sets, never within. User-item interaction graphs (recommendations), document-word graphs (topic modeling), drug-target networks.
Bipartite structure is exploited by specialized architectures. Many recommendation GNNs (LightGCN, NGCF) treat the user-item graph as bipartite.
Heterogeneous Graphs
Multiple node types and edge types. A knowledge graph has entities (people, places, concepts) connected by different relation types (born_in, works_at, part_of). Academic networks have papers, authors, venues, with different edge types (authored, published_at, cites).
A heterogeneous graph is defined as where maps nodes to types and maps edges to relation types.
Specialized architectures (HAN, RGCN, HGT) handle heterogeneous graphs by learning separate transformations per relation type.
Dynamic Graphs
Edges appear and disappear over time. A transaction network changes with every new transaction. A social network changes as friendships form and dissolve.
Dynamic GNNs (TGAT, TGN) incorporate temporal information - when was the edge created, how long ago, what is the recency weighting.
Knowledge Graphs
Directed graphs of (entity, relation, entity) triples: (Einstein, born_in, Germany), (Germany, located_in, Europe). Relation types have semantic meaning. Models like TransE, DistMult, RotatE, and GNN-based approaches learn entity embeddings that predict missing triples (knowledge graph completion).
Why CNNs Cannot Work on Graphs
CNNs revolutionized image processing by exploiting the grid structure of pixels. The key properties CNNs require:
- Fixed, regular topology - every interior pixel has exactly 8 neighbors in the same relative arrangement
- Spatial locality - nearby pixels are semantically related
- Translational equivariance - a feature detector valid at position (10, 10) is valid at (50, 50)
- Canonical ordering - the 3×3 patch around any pixel has a fixed, consistent layout
Graphs violate all four:
Grid (image): Graph:
■ ■ ■ ■ ■ A --- B
■ ■ ■ ■ ■ \ / \
■ ■ ■ ■ ■ C D
■ ■ ■ ■ ■ |
■ ■ ■ ■ ■ E
Every pixel: 8 neighbors. Degrees: A:2, B:3, C:2, D:2, E:1
Fixed 3×3 kernel applies. No kernel size fits all nodes.
Same at every location. No positional reference frame.
Variable neighborhood size: A CNN kernel has a fixed footprint (3×3 = 9 inputs). A graph node might have 1 neighbor or 10,000 neighbors. You cannot design a single kernel that handles this.
No canonical ordering: The 8 pixels around a pixel have a natural order (top-left, top, top-right, ...). The neighbors of a graph node have no natural order. Any permutation of the neighbor list must produce the same output - operations must be permutation invariant.
No shared spatial structure: CNN translational equivariance assumes the same visual pattern (edge, curve, eye) appears the same way wherever it appears in the image. In a graph, two nodes at the "same position" in different subgraphs may have completely different structural roles. The concept of "position" does not generalize.
Why this matters architecturally: Every major GNN operation replaces ordered convolution with permutation-invariant aggregation over the neighbor set:
Sum, mean, and max are all permutation invariant. Concatenation followed by MLP is not (unless you sort, which introduces arbitrary bias).
Graph Properties and Statistics
Understanding your graph's structure before training a GNN is as important as exploring your dataset before fitting any ML model.
Degree Distribution
The degree of node is - the number of neighbors. Real-world graphs often follow a power-law degree distribution:
This means a few "hub" nodes have enormous degree while most nodes have very few connections. Social networks, the web, and protein interaction networks all exhibit this. It has profound implications for GNNs: hub nodes will dominate naive aggregation, and degree normalization becomes essential.
Clustering Coefficient
The clustering coefficient of node measures how many of its neighbors are also connected to each other:
means all neighbors form a clique. means no two neighbors are connected. High clustering implies strong community structure - a useful inductive bias.
Graph Diameter
The diameter is the longest shortest path between any two nodes:
where is the shortest path length between and . The diameter upper-bounds the number of GNN layers needed to let any two nodes communicate. For Cora (diameter ~20), you might think you need 20 layers - but over-smoothing makes this impossible. In practice, 2–3 layers suffice because the graph has short effective diameters due to clustering.
Homophily Ratio
The homophily ratio measures how often connected nodes share the same label:
: connected nodes tend to have the same class (citation networks). : connected nodes tend to have different classes (some dating networks). High homophily makes GCN work well. Low homophily (heterophily) requires architectures like GAT or H2GCN that can distinguish own from neighbor features.
import networkx as nx
import numpy as np
def graph_statistics(G):
"""Compute key graph statistics for EDA."""
n = G.number_of_nodes()
m = G.number_of_edges()
degrees = [d for _, d in G.degree()]
print(f"Nodes: {n:,}")
print(f"Edges: {m:,}")
print(f"Density: {2*m / (n*(n-1)):.6f}")
print(f"Mean degree: {np.mean(degrees):.2f}")
print(f"Median degree: {np.median(degrees):.1f}")
print(f"Max degree: {max(degrees)}")
print(f"Avg clustering coeff: {nx.average_clustering(G):.4f}")
# Connected components
components = list(nx.connected_components(G))
print(f"Connected components: {len(components)}")
largest = max(components, key=len)
print(f"Largest component: {len(largest)} nodes ({100*len(largest)/n:.1f}%)")
# Diameter (only feasible on small graphs or largest component)
if n < 5000:
subgraph = G.subgraph(largest)
print(f"Diameter: {nx.diameter(subgraph)}")
return degrees
# Example with Cora loaded via NetworkX
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import to_networkx
dataset = Planetoid(root='/tmp/cora', name='Cora')
data = dataset[0]
G = to_networkx(data, to_undirected=True)
graph_statistics(G)
Real-World Graph Datasets
Cora - Citation Network
- Nodes: 2,708 scientific papers
- Edges: 5,429 citation links (undirected)
- Node features: 1,433-dimensional bag-of-words (binary - word presence)
- Labels: 7 topics (Neural Networks, Rule Learning, Case Based, etc.)
- Task: node classification - predict paper topic
- Split: 140 train / 500 val / 1000 test (the classic Kipf split)
- Homophily: ~0.81 (high - papers tend to cite papers in the same topic)
Cora is the MNIST of graph ML. Every GNN paper reports accuracy on Cora. It is small enough to train instantly but structured enough to reveal meaningful differences between methods.
OGB Benchmarks - Open Graph Benchmark
The Open Graph Benchmark (Hu et al., 2020) provides standardized, realistic benchmarks at real-world scale:
| Dataset | Task | Nodes | Edges | Features |
|---|---|---|---|---|
| ogbn-arxiv | Node classification | 169K papers | 1.2M citations | 128-dim (Word2Vec) |
| ogbn-products | Node classification | 2.4M products | 61M co-purchases | 100-dim |
| ogbn-proteins | Node classification | 132K proteins | 39M protein interactions | 8-dim (edge features) |
| ogbl-collab | Link prediction | 235K authors | 1.3M collaborations | 128-dim |
| ogbg-molhiv | Graph classification | 1K avg atoms/mol | - | 9-dim atom features |
OGB replaced Cora as the standard benchmark because Cora is too small to differentiate modern methods. ogbn-arxiv and ogbn-products test scalability seriously.
Reddit - Social Network Posts
- Nodes: 232,965 Reddit posts
- Edges: 114,615,892 (commenter connects post to another post they comment on)
- Node features: 602-dimensional (GloVe embedding + engagement stats)
- Labels: 41 subreddit communities
- Task: node classification
- Scale: too large for full-batch training - requires mini-batch with neighbor sampling
Reddit is the standard benchmark for testing scalable GNN training methods. GraphSAGE paper reported 93.9% accuracy. Full-batch GCN is infeasible here.
Yelp - Review Network
- Nodes: 716,847 users
- Edges: based on shared review behavior
- Task: multi-label classification of user types
- Challenge: severe class imbalance, large scale
Used to benchmark scalable methods (GraphSAINT, Cluster-GCN).
ML Tasks on Graphs
GNNs solve three fundamental task types, each requiring different model heads:
Node-Level Tasks
Predict a property of each node. The GNN computes a node embedding for each node, then passes it through a linear classifier:
Loss: cross-entropy on labeled nodes only (semi-supervised setting).
Examples: paper topic classification (Cora), user type prediction (Yelp), fraud detection, protein function prediction.
Edge/Link-Level Tasks
Predict whether an edge should exist between two nodes. The GNN computes node embeddings, then scores each node pair:
or
Training: typically with contrastive loss - maximize score for existing edges, minimize for random non-edges.
Examples: friend recommendation, drug-target interaction prediction, knowledge graph completion, citation network link prediction.
Graph-Level Tasks
Predict a property of an entire graph. The GNN computes node embeddings, then pools them into a single graph embedding:
READOUT is typically sum, mean, or a more expressive differentiable pooling (DiffPool, SAGPool).
Examples: molecular toxicity prediction, drug solubility, chemical property prediction, graph-level classification.
The Neighborhood Aggregation Insight
If CNNs cannot work, what can? The key insight behind all GNNs:
A node's representation should be computed from its own features and the representations of its neighbors. Recursively.
This is called neighborhood aggregation or message passing. At layer 0, every node starts with its raw features. At each layer, every node aggregates messages from its neighbors and updates its own representation:
After layers, encodes a summary of the -hop neighborhood around .
The crucial property: the aggregation function must be permutation invariant - the result cannot depend on the arbitrary ordering of the neighbor list. Sum, mean, and max are all permutation invariant.
Different GNN architectures differ in how they define this aggregation:
- GCN: weighted average with symmetric degree normalization
- GAT: learned attention weights (different neighbors get different importance)
- GraphSAGE: sample + aggregate (mean, max-pool, or LSTM)
- GIN: sum aggregation with MLP (maximally expressive)
The Weisfeiler-Lehman Test and GNN Expressivity
How expressive can a GNN be? This question has a precise answer rooted in a 1968 algorithm: the Weisfeiler-Lehman (WL) graph isomorphism test.
The WL Test
The WL test determines whether two graphs are non-isomorphic (structurally different):
- Assign every node the same initial label (or use node features as initial labels)
- At each iteration: update each node's label by hashing its current label together with the multiset of its neighbors' labels
- If the label distributions of two graphs ever differ, they are not isomorphic
- If labels stabilize without differing, the test is inconclusive (may be isomorphic)
def wl_iteration(labels, adj_list):
"""One step of the Weisfeiler-Lehman test."""
new_labels = {}
for node in labels:
# Collect own label + sorted neighbor labels
neighbor_labels = sorted([labels[n] for n in adj_list[node]])
# Hash into a new label
signature = (labels[node], tuple(neighbor_labels))
new_labels[node] = hash(signature)
return new_labels
# Two-node chain vs isolated nodes
chain = {0: [1], 1: [0]} # 0-1
isolated = {0: [], 1: []} # 0 1
labels_chain = {0: 0, 1: 0}
labels_isolated = {0: 0, 1: 0}
# After 1 WL iteration:
# Chain: node 0 gets label hash(0, (0,)), node 1 gets hash(0, (0,))
# Isolated: node 0 gets hash(0, ()), node 1 gets hash(0, ())
# Different! WL correctly identifies them as non-isomorphic.
WL and GNN Expressivity
Xu et al. (2019) proved a fundamental theorem: any GNN using sum aggregation is at most as expressive as the WL test. Specifically:
- If two nodes have the same WL label at iteration , any -layer GNN assigns them the same embedding
- GNNs that use sum aggregation + injective update function achieve WL-level expressivity
- GNNs that use mean or max aggregation can be strictly less expressive than WL
Why does expressivity matter?
The WL test fails to distinguish certain pairs of non-isomorphic graphs - for example, regular graphs where every node has the same degree. Any GNN that is limited to WL expressivity will assign the same embedding to nodes in these graphs, even if they have structurally different positions.
Practical implications:
- Mean aggregation (GCN) cannot distinguish whether a node has 2 neighbors with feature [1,0] vs 1 neighbor with feature [1,0] + 1 neighbor with feature [0,1] - they produce the same mean
- Sum aggregation (GIN) can distinguish these cases
- For tasks requiring structural discrimination (counting triangles, detecting ring structures in molecules), GIN outperforms GCN
The GIN Architecture
Graph Isomorphism Network (Xu et al., 2019) achieves WL-level expressivity via:
where is either a learned scalar or fixed to 0. The sum (not mean or max) over neighbors, combined with an injective MLP update, makes this maximally expressive within the WL class.
from torch_geometric.nn import GINConv
import torch.nn as nn
import torch.nn.functional as F
class GIN(nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels, num_layers=5):
super().__init__()
self.convs = nn.ModuleList()
self.bns = nn.ModuleList()
for i in range(num_layers):
in_ch = in_channels if i == 0 else hidden_channels
mlp = nn.Sequential(
nn.Linear(in_ch, hidden_channels),
nn.BatchNorm1d(hidden_channels),
nn.ReLU(),
nn.Linear(hidden_channels, hidden_channels),
)
self.convs.append(GINConv(mlp, train_eps=True))
self.bns.append(nn.BatchNorm1d(hidden_channels))
self.classifier = nn.Linear(hidden_channels * num_layers, out_channels)
def forward(self, x, edge_index, batch):
from torch_geometric.nn import global_add_pool
import torch
# Collect embeddings from each layer (jumping knowledge style)
all_embeddings = []
for conv, bn in zip(self.convs, self.bns):
x = F.relu(bn(conv(x, edge_index)))
# Graph-level pooling (for graph classification)
graph_emb = global_add_pool(x, batch)
all_embeddings.append(graph_emb)
# Concatenate all layers
out = torch.cat(all_embeddings, dim=1)
return self.classifier(out)
NetworkX: Building and Exploring Graphs
Before GNNs, understand the graph with NetworkX:
import networkx as nx
import matplotlib.pyplot as plt
# Build a graph from scratch
G = nx.Graph()
G.add_nodes_from([0, 1, 2, 3, 4])
G.add_edges_from([(0, 1), (0, 2), (1, 3), (2, 3), (3, 4)])
# Node attributes
nx.set_node_attributes(G, {0: "A", 1: "B", 2: "C", 3: "D", 4: "E"}, name="label")
nx.set_node_attributes(G, {0: 0, 1: 0, 2: 1, 3: 1, 4: 1}, name="community")
# Edge weights
G[0][1]['weight'] = 0.8
G[1][3]['weight'] = 0.3
# Basic statistics
print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())
print("Degree sequence:", sorted([d for _, d in G.degree()], reverse=True))
print("Is connected:", nx.is_connected(G))
print("Average clustering:", nx.average_clustering(G))
print("Diameter:", nx.diameter(G))
print("Avg shortest path:", nx.average_shortest_path_length(G))
# Centrality measures
degree_centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)
print("\nPageRank scores:")
for node, score in sorted(pagerank.items(), key=lambda x: -x[1]):
print(f" Node {node}: {score:.4f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Draw graph
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, ax=axes[0], with_labels=True, node_color='lightblue',
node_size=800, edge_color='gray', font_size=12)
axes[0].set_title("Graph Structure")
# Degree distribution
degrees = [d for _, d in G.degree()]
axes[1].hist(degrees, bins=max(degrees)+1, color='steelblue', edgecolor='white')
axes[1].set_xlabel("Degree")
axes[1].set_ylabel("Count")
axes[1].set_title("Degree Distribution")
plt.tight_layout()
plt.savefig("graph_analysis.png", dpi=150)
Converting Between NetworkX and PyG
from torch_geometric.utils import from_networkx, to_networkx
import torch
# NetworkX → PyG Data
data = from_networkx(G, group_node_attrs=['community'])
print(data) # Data(edge_index=[2, 10], community=[5, 1])
# Add node features manually
data.x = torch.randn(5, 16) # 5 nodes, 16 features
data.y = torch.tensor([0, 0, 1, 1, 1]) # node labels
# PyG Data → NetworkX (for analysis)
from torch_geometric.datasets import Planetoid
cora = Planetoid(root='/tmp/cora', name='Cora')[0]
G_cora = to_networkx(cora, to_undirected=True)
print(f"Cora: {G_cora.number_of_nodes()} nodes, {G_cora.number_of_edges()} edges")
Scale Challenges: Billion-Node Graphs
Understanding graphs at small scale is very different from deploying GNNs on the graphs that matter in production.
Meta's Social Graph
- Nodes: ~3 billion users + ~10 billion items
- Edges: hundreds of billions of social connections
- Task: content ranking, friend recommendations, abuse detection
Full adjacency matrix: impossible to store ( bytes). Even the edge list at 64-bit precision requires dozens of petabytes. Training a GNN with full-graph forward passes is completely infeasible.
Solutions used in practice:
- Mini-batch training with neighborhood sampling (GraphSAGE / PinSage approach)
- Cluster-based training (partition graph into clusters, train within clusters)
- Distributed graph engines (Facebook's Gandalf, Amazon's DGL distributed)
- Feature-server architectures that serve node features on demand during training
LinkedIn's Economic Graph
LinkedIn's "Economic Graph" connects people, companies, jobs, schools, and skills. GNNs on this graph power job recommendations, people you may know, and skills endorsement ranking. LinkedIn uses GraphSAGE-style inductive learning so that new users (millions per day) get embeddings immediately without retraining.
Memory Layout Considerations
For a graph with nodes, edges, and features:
| Component | Storage |
|---|---|
| Node features (, float32) | = 1 GB |
| Edge index (, int64) | = 160 MB |
| Node embeddings (, float32) | 1 GB |
| Adjacency matrix (dense, float32) | = 4 TB - infeasible |
Sparse storage (CSR or COO) is mandatory for all practical graph ML.
Production Applications Overview
| Domain | Graph setup | Task | GNN type | Scale |
|---|---|---|---|---|
| Drug discovery | Atom-bond molecular graphs | Molecular property prediction | MPNN, SchNet | Millions of molecules |
| Fraud detection | Account-device-IP-phone graphs | Node classification | GraphSAGE, GAT | Billions of nodes (Alipay, PayPal) |
| Recommendation | User-item interaction graphs | Link prediction | PinSage, LightGCN | Pinterest: 3B nodes, 18B edges |
| Knowledge graphs | Entity-relation triples | KG completion | R-GCN, CompGCN | Wikidata: 80M items |
| Traffic forecasting | Road network graphs | Graph regression | DCRNN, STGCN | City-scale road networks |
| Protein structure | Residue interaction networks | Function prediction | GVP-GNN, EquiGNN | UniProt: 220M sequences |
YouTube Resources
| Video | Channel | Why Watch |
|---|---|---|
| Graph Neural Networks, a review of methods and applications | Stanford CS224W | Jure Leskovec's overview - the definitive introduction |
| CS224W: Machine Learning with Graphs (Lecture 1) | Stanford Online | Full course lecture with examples and motivation |
| An Introduction to Graph Neural Networks | Aleksa Gordic | Animated explainer with intuition-first approach |
| Graph Representation Learning - William Hamilton | Montreal AI & Neuroscience | Author of GraphSAGE explains the field |
| Why Graph Neural Networks? - DeepMind | DeepMind | DeepMind's motivation from scientific computing angle |
Production Engineering Notes
:::note Graph EDA Is Non-Negotiable Always compute basic graph statistics before choosing a GNN architecture. Degree distribution, homophily ratio, clustering coefficient, and connected component analysis all inform architecture and hyperparameter choices. A highly heterophilic graph (homophily < 0.3) needs a different approach than a homophilic citation network. :::
:::warning Memory Scales as O(n²) for Dense Representations The adjacency matrix for even a modest 100K-node graph requires 40 GB as float32. Always use sparse representations (edge_index in PyG). If someone asks you to construct an adjacency matrix for a large graph, immediately question whether a sparse alternative works instead. :::
:::danger Never Ignore Graph Diameter When Choosing Depth A common mistake: setting GNN depth to "match the diameter" so all nodes can communicate. With diameter 20, you think you need 20 layers. In practice, over-smoothing kills performance after 3–4 layers. The optimal depth is almost always 2–3 regardless of diameter. Use global pooling, jumping knowledge, or other mechanisms if you need long-range information. :::
Interview Q&A
Q1: What is the fundamental limitation of applying CNNs to graph-structured data?
CNNs require a fixed, regular topology with a canonical ordering of neighbors. Graphs have variable-degree nodes with no natural ordering of neighbors. A 3×3 CNN kernel cannot apply to a node with 47 neighbors, and even if you padded it, the ordering of those 47 neighbors is arbitrary - permuting them would give different outputs, which is wrong. Additionally, CNNs exploit translational equivariance - the same filter pattern is valid everywhere in an image - but graphs have no such spatial invariance. GNNs solve this by using permutation-invariant aggregation (sum, mean, max) over the set of neighbors rather than ordered convolutions.
Q2: Define the graph Laplacian . Why is it important for GNNs?
The graph Laplacian where is the diagonal degree matrix () and is the adjacency matrix. The Laplacian has three key properties: it is symmetric positive semidefinite (all eigenvalues ), its smallest eigenvalue is 0 (with eigenvector ), and its quadratic form measures signal variation across edges. The normalized Laplacian defines the graph Fourier transform via eigendecomposition . Spectral GNNs (ChebNet, GCN) derive their operations from polynomial approximations of filters in this Fourier basis. GCN's symmetric normalization is directly the first-order approximation of spectral convolution on this Laplacian.
Q3: Explain the Weisfeiler-Lehman test. What does it tell us about GNN expressivity?
The WL test checks graph non-isomorphism by iteratively relabeling each node with a hash of its own label and the multiset of its neighbors' labels. If two graphs' label distributions ever differ, they are non-isomorphic. Xu et al. (2019) proved that any GNN with a permutation-invariant aggregation is at most as powerful as the 1-WL test - if the WL test cannot distinguish two nodes, neither can the GNN. Furthermore, GNNs using sum aggregation with injective update functions (GIN) are exactly as powerful as WL. Mean and max aggregation are strictly weaker - they cannot distinguish, for example, a node with two neighbors of feature [1] from a node with one neighbor of feature [1] (mean gives 1 in both cases). This matters for tasks requiring counting of structural patterns like triangles, cycles, or motifs.
Q4: Name three real-world graph datasets and explain what makes each suitable for benchmarking different aspects of GNNs.
Cora (2708 nodes, 5429 edges): small citation network, high homophily (), fast to train. Tests basic GNN correctness and semi-supervised learning with very few labels (140 for 7 classes). ogbn-arxiv (169K nodes, 1.2M edges): medium-scale citation network, tests scalability beyond small graphs, uses realistic Arxiv paper data. Reddit (232K nodes, 114M edges): large-scale social network requiring mini-batch training - tests sampling-based scalability. You cannot run full-batch GCN on Reddit; it forces you to use NeighborLoader or Cluster-GCN. Together these three test different dimensions: correctness, medium-scale generalization, and production-scale training.
Q5: What is homophily and why does it affect GNN architecture choice?
Homophily ratio measures how often connected nodes share the same class label. High homophily (): neighbors tend to be in the same class - citation networks (same-topic papers cite each other), social networks (birds of a feather flock together). Low homophily / heterophily (): neighbors tend to differ - dating networks (opposite attract), certain bipartite graphs. GCN and most simple GNNs assume homophily implicitly: they average neighbor features, which makes sense if neighbors have similar labels. On heterophilic graphs, averaging neighbor features corrupts the node's own representation with opposing-class features. Specialized architectures (H2GCN, GPR-GNN, FAGCN) address heterophily by combining the node's own features with neighbor features at different scales or with signed attention.
Q6: Describe the three ML task types on graphs and the corresponding model head design.
Node classification: GNN computes embedding for each node; head is a linear layer ; loss is cross-entropy on labeled nodes; training uses all edges for message passing but loss on labeled subset only (semi-supervised). Link prediction: GNN computes node embeddings; head scores each candidate edge pair: or ; trained with binary cross-entropy or contrastive loss with negative sampling. Graph classification: GNN computes per-node embeddings, then READOUT (sum/mean/attention pooling) produces a graph-level vector ; head is ; trained with standard cross-entropy on labeled graphs. The READOUT choice matters: sum is maximally expressive (GIN), mean loses count information, learned pooling (DiffPool) can capture hierarchical structure.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Graph Explorer demo on the EngineersOfAI Playground - no code required.
:::
