Skip to main content

Why Graphs for ML

The Real Interview Moment

You are interviewing for an ML engineer role at a fintech company. The interviewer describes a fraud detection problem: each transaction has 47 features - amount, merchant category, device type, location, time. You have trained a gradient boosted tree that catches 80% of fraud with 0.1% false positives.

"Good," they say. "Now. We have a new signal. Fraudulent accounts often share devices, phone numbers, or shipping addresses with other fraudulent accounts - forming rings. How would you use that?"

You pause. This is relational information. The tree model has no concept of connections between accounts. You would need to join on device ID, aggregate features from the linked accounts, engineer "number of flagged accounts sharing this device" - laboriously, statically, and without capturing multi-hop patterns. If a fraudulent account is separated from another by two intermediate clean-looking accounts, your hand-engineered feature misses it entirely.

Graph neural networks solve this directly. A GNN operating on the account-device-address graph propagates fraud signals through connections automatically. An account linked (through a device) to (through an IP address) to a known fraudster gets a contaminated embedding within two message-passing steps. No feature engineering required. No manually specified hop depth. The model learns how far to look and what to aggregate.

This is the core promise of graph machine learning: move beyond the independent-rows assumption of tabular ML and operate directly on the structure of real-world relational systems.

Why This Exists - The Failure of Tabular ML on Relational Data

Tabular machine learning is built on a foundational assumption: each row is an independent and identically distributed (i.i.d.) sample drawn from the same distribution. The features describe what an entity is, not how it relates to other entities.

This assumption holds for many classical problems - predicting house prices given square footage and location, classifying emails given word frequencies, forecasting demand given historical sales. In these cases, the relationship between rows is irrelevant (or encoded indirectly through shared categorical variables).

But for a vast class of real-world problems, the critical information lives not in individual entities but in the connections between them. A drug molecule is not just a bag of atoms - it is a specific arrangement of atoms connected by bonds of specific types. A paper in a citation network is not just a bag of words - its authority depends on what other papers cite it and what those papers say. A user in a social network is not just a demographic profile - their purchasing behavior is shaped by their friends' purchases.

When relational information matters, tabular ML forces you to hand-engineer relational features: "number of friends who purchased item X," "average rating from users within 2 hops," "number of shared devices with known fraudsters." This engineering is manual, static, and limited to whatever hops and aggregations you thought to include. You must decide the structure of the feature before training, without any signal from the data about which relational patterns are actually predictive.

Graph ML learns these patterns automatically from the structure of the graph itself.

Graph Structure Formalism

A graph G=(V,E,X)G = (V, E, X) consists of:

  • Nodes V={v1,v2,,vn}V = \{v_1, v_2, \ldots, v_n\} - entities (users, atoms, papers, accounts). n=Vn = |V| is the number of nodes.
  • Edges EV×VE \subseteq V \times V - relationships (friendship, bond, citation, transaction). m=Em = |E| is the number of edges.
  • Node feature matrix XRn×dX \in \mathbb{R}^{n \times d} - each row xvRd\mathbf{x}_v \in \mathbb{R}^d is the feature vector of node vv.
  • Edge feature matrix XERm×deX_E \in \mathbb{R}^{m \times d_e} - optional features on each edge.
  • Graph-level label yGy_G - for graph classification tasks.
Molecular graph example (aspirin):

O

C - O - H (ester group)
|
C === C (aromatic ring)
/ \ / \
C C=C C
\ | \ /
C=C (aromatic)

Nodes: Carbon (C), Oxygen (O), Hydrogen (H)
Node features: atomic number, charge, hybridization, aromaticity flag
Edges: chemical bonds
Edge features: bond type (single/double/aromatic), bond length (Å), bond angle

The Adjacency Matrix

The adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} encodes connectivity:

Aij={1if edge (vi,vj)E0otherwiseA_{ij} = \begin{cases} 1 & \text{if edge } (v_i, v_j) \in E \\ 0 & \text{otherwise} \end{cases}

For an undirected graph, AA is symmetric: Aij=AjiA_{ij} = A_{ji}.

Example - triangle graph (nodes 0, 1, 2, all connected):

A=(011101110)A = \begin{pmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{pmatrix}

The degree of node ii is di=jAijd_i = \sum_j A_{ij} - the number of edges incident to node ii.

The degree matrix DD is diagonal: Dii=diD_{ii} = d_i.

The Graph Laplacian

The unnormalized graph Laplacian is:

L=DAL = D - A

For the triangle graph:

L=(211121112)L = \begin{pmatrix} 2 & -1 & -1 \\ -1 & 2 & -1 \\ -1 & -1 & 2 \end{pmatrix}

The Laplacian has elegant properties:

  • Symmetric and positive semidefinite (all eigenvalues 0\geq 0)
  • Smallest eigenvalue is 0 (with eigenvector 1/n\mathbf{1}/\sqrt{n})
  • Number of zero eigenvalues equals the number of connected components
  • Quadratic form: xLx=(i,j)E(xixj)2\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2 - measures signal variation across edges

The normalized Laplacian stabilizes eigenvalues to [0,2][0, 2]:

L=D1/2LD1/2=ID1/2AD1/2\mathcal{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}

This is the foundation of spectral graph theory: the eigendecomposition L=UΛU\mathcal{L} = U\Lambda U^\top defines the "graph Fourier transform," analogous to the DFT for regular signals. Convolution in the graph frequency domain is multiplication by g(Λ)g(\Lambda).

import numpy as np

def compute_laplacian(A, normalized=True):
"""Compute graph Laplacian from adjacency matrix."""
n = A.shape[0]
degrees = A.sum(axis=1) # d_i = sum of row i
D = np.diag(degrees) # degree matrix
L = D - A # unnormalized Laplacian

if not normalized:
return L

# Normalized Laplacian: L_sym = D^{-1/2} L D^{-1/2}
d_inv_sqrt = np.where(degrees > 0, 1.0 / np.sqrt(degrees), 0.0)
D_inv_sqrt = np.diag(d_inv_sqrt)
L_sym = D_inv_sqrt @ L @ D_inv_sqrt
return L_sym

# Triangle graph
A = np.array([[0,1,1],[1,0,1],[1,1,0]], dtype=float)
L = compute_laplacian(A, normalized=False)
L_sym = compute_laplacian(A, normalized=True)

# Eigenvalues
eigenvalues = np.linalg.eigvalsh(L_sym)
print("Eigenvalues of normalized Laplacian:", eigenvalues)
# [0.0, 1.0, 1.5] (always in [0, 2])

Graph Types

Real-world graphs come in many flavors. Choosing the right GNN depends on understanding which type you are dealing with.

Directed vs Undirected

Undirected: edges have no direction. If (u,v)E(u, v) \in E, then (v,u)E(v, u) \in E. Friendships, molecular bonds, protein interactions.

Directed: edges have a direction. (u,v)E(u, v) \in E does not imply (v,u)E(v, u) \in E. Twitter follows (A follows B ≠ B follows A), citations (paper A cites paper B).

The adjacency matrix of a directed graph is not symmetric. Message passing must distinguish in-edges from out-edges.

Bipartite Graphs

Two disjoint node sets UU and VV; edges only between sets, never within. User-item interaction graphs (recommendations), document-word graphs (topic modeling), drug-target networks.

G=(U,V,E),EU×VG = (U, V, E), \quad E \subseteq U \times V

Bipartite structure is exploited by specialized architectures. Many recommendation GNNs (LightGCN, NGCF) treat the user-item graph as bipartite.

Heterogeneous Graphs

Multiple node types and edge types. A knowledge graph has entities (people, places, concepts) connected by different relation types (born_in, works_at, part_of). Academic networks have papers, authors, venues, with different edge types (authored, published_at, cites).

A heterogeneous graph is defined as G=(V,E,ϕ,ψ)G = (V, E, \phi, \psi) where ϕ:VA\phi: V \to \mathcal{A} maps nodes to types and ψ:ER\psi: E \to \mathcal{R} maps edges to relation types.

Specialized architectures (HAN, RGCN, HGT) handle heterogeneous graphs by learning separate transformations per relation type.

Dynamic Graphs

Edges appear and disappear over time. A transaction network changes with every new transaction. A social network changes as friendships form and dissolve.

Dynamic GNNs (TGAT, TGN) incorporate temporal information - when was the edge created, how long ago, what is the recency weighting.

Knowledge Graphs

Directed graphs of (entity, relation, entity) triples: (Einstein, born_in, Germany), (Germany, located_in, Europe). Relation types have semantic meaning. Models like TransE, DistMult, RotatE, and GNN-based approaches learn entity embeddings that predict missing triples (knowledge graph completion).

Why CNNs Cannot Work on Graphs

CNNs revolutionized image processing by exploiting the grid structure of pixels. The key properties CNNs require:

  1. Fixed, regular topology - every interior pixel has exactly 8 neighbors in the same relative arrangement
  2. Spatial locality - nearby pixels are semantically related
  3. Translational equivariance - a feature detector valid at position (10, 10) is valid at (50, 50)
  4. Canonical ordering - the 3×3 patch around any pixel has a fixed, consistent layout

Graphs violate all four:

Grid (image): Graph:

■ ■ ■ ■ ■ A --- B
■ ■ ■ ■ ■ \ / \
■ ■ ■ ■ ■ C D
■ ■ ■ ■ ■ |
■ ■ ■ ■ ■ E

Every pixel: 8 neighbors. Degrees: A:2, B:3, C:2, D:2, E:1
Fixed 3×3 kernel applies. No kernel size fits all nodes.
Same at every location. No positional reference frame.

Variable neighborhood size: A CNN kernel has a fixed footprint (3×3 = 9 inputs). A graph node might have 1 neighbor or 10,000 neighbors. You cannot design a single kernel that handles this.

No canonical ordering: The 8 pixels around a pixel have a natural order (top-left, top, top-right, ...). The neighbors of a graph node have no natural order. Any permutation of the neighbor list must produce the same output - operations must be permutation invariant.

No shared spatial structure: CNN translational equivariance assumes the same visual pattern (edge, curve, eye) appears the same way wherever it appears in the image. In a graph, two nodes at the "same position" in different subgraphs may have completely different structural roles. The concept of "position" does not generalize.

Why this matters architecturally: Every major GNN operation replaces ordered convolution with permutation-invariant aggregation over the neighbor set:

hv(k)=σ ⁣(WAGGREGATE ⁣({hu(k1):uN(v)}))h_v^{(k)} = \sigma\!\left(W \cdot \text{AGGREGATE}\!\left(\left\{h_u^{(k-1)} : u \in \mathcal{N}(v)\right\}\right)\right)

Sum, mean, and max are all permutation invariant. Concatenation followed by MLP is not (unless you sort, which introduces arbitrary bias).

Graph Properties and Statistics

Understanding your graph's structure before training a GNN is as important as exploring your dataset before fitting any ML model.

Degree Distribution

The degree of node vv is dv=N(v)d_v = |\mathcal{N}(v)| - the number of neighbors. Real-world graphs often follow a power-law degree distribution:

P(d)dγ,γ23P(d) \propto d^{-\gamma}, \quad \gamma \approx 2\text{–}3

This means a few "hub" nodes have enormous degree while most nodes have very few connections. Social networks, the web, and protein interaction networks all exhibit this. It has profound implications for GNNs: hub nodes will dominate naive aggregation, and degree normalization becomes essential.

Clustering Coefficient

The clustering coefficient of node vv measures how many of its neighbors are also connected to each other:

Cv={(u,w)E:u,wN(v)}dv(dv1)/2C_v = \frac{|\{(u,w) \in E : u, w \in \mathcal{N}(v)\}|}{d_v(d_v - 1)/2}

Cv=1C_v = 1 means all neighbors form a clique. Cv=0C_v = 0 means no two neighbors are connected. High clustering implies strong community structure - a useful inductive bias.

Graph Diameter

The diameter is the longest shortest path between any two nodes:

diam(G)=maxu,vVd(u,v)\text{diam}(G) = \max_{u,v \in V} d(u, v)

where d(u,v)d(u, v) is the shortest path length between uu and vv. The diameter upper-bounds the number of GNN layers needed to let any two nodes communicate. For Cora (diameter ~20), you might think you need 20 layers - but over-smoothing makes this impossible. In practice, 2–3 layers suffice because the graph has short effective diameters due to clustering.

Homophily Ratio

The homophily ratio measures how often connected nodes share the same label:

h={(u,v)E:yu=yv}Eh = \frac{|\{(u,v) \in E : y_u = y_v\}|}{|E|}

h1h \approx 1: connected nodes tend to have the same class (citation networks). h0h \approx 0: connected nodes tend to have different classes (some dating networks). High homophily makes GCN work well. Low homophily (heterophily) requires architectures like GAT or H2GCN that can distinguish own from neighbor features.

import networkx as nx
import numpy as np

def graph_statistics(G):
"""Compute key graph statistics for EDA."""
n = G.number_of_nodes()
m = G.number_of_edges()
degrees = [d for _, d in G.degree()]

print(f"Nodes: {n:,}")
print(f"Edges: {m:,}")
print(f"Density: {2*m / (n*(n-1)):.6f}")
print(f"Mean degree: {np.mean(degrees):.2f}")
print(f"Median degree: {np.median(degrees):.1f}")
print(f"Max degree: {max(degrees)}")
print(f"Avg clustering coeff: {nx.average_clustering(G):.4f}")

# Connected components
components = list(nx.connected_components(G))
print(f"Connected components: {len(components)}")
largest = max(components, key=len)
print(f"Largest component: {len(largest)} nodes ({100*len(largest)/n:.1f}%)")

# Diameter (only feasible on small graphs or largest component)
if n < 5000:
subgraph = G.subgraph(largest)
print(f"Diameter: {nx.diameter(subgraph)}")

return degrees

# Example with Cora loaded via NetworkX
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import to_networkx

dataset = Planetoid(root='/tmp/cora', name='Cora')
data = dataset[0]
G = to_networkx(data, to_undirected=True)
graph_statistics(G)

Real-World Graph Datasets

Cora - Citation Network

  • Nodes: 2,708 scientific papers
  • Edges: 5,429 citation links (undirected)
  • Node features: 1,433-dimensional bag-of-words (binary - word presence)
  • Labels: 7 topics (Neural Networks, Rule Learning, Case Based, etc.)
  • Task: node classification - predict paper topic
  • Split: 140 train / 500 val / 1000 test (the classic Kipf split)
  • Homophily: ~0.81 (high - papers tend to cite papers in the same topic)

Cora is the MNIST of graph ML. Every GNN paper reports accuracy on Cora. It is small enough to train instantly but structured enough to reveal meaningful differences between methods.

OGB Benchmarks - Open Graph Benchmark

The Open Graph Benchmark (Hu et al., 2020) provides standardized, realistic benchmarks at real-world scale:

DatasetTaskNodesEdgesFeatures
ogbn-arxivNode classification169K papers1.2M citations128-dim (Word2Vec)
ogbn-productsNode classification2.4M products61M co-purchases100-dim
ogbn-proteinsNode classification132K proteins39M protein interactions8-dim (edge features)
ogbl-collabLink prediction235K authors1.3M collaborations128-dim
ogbg-molhivGraph classification1K avg atoms/mol-9-dim atom features

OGB replaced Cora as the standard benchmark because Cora is too small to differentiate modern methods. ogbn-arxiv and ogbn-products test scalability seriously.

Reddit - Social Network Posts

  • Nodes: 232,965 Reddit posts
  • Edges: 114,615,892 (commenter connects post to another post they comment on)
  • Node features: 602-dimensional (GloVe embedding + engagement stats)
  • Labels: 41 subreddit communities
  • Task: node classification
  • Scale: too large for full-batch training - requires mini-batch with neighbor sampling

Reddit is the standard benchmark for testing scalable GNN training methods. GraphSAGE paper reported 93.9% accuracy. Full-batch GCN is infeasible here.

Yelp - Review Network

  • Nodes: 716,847 users
  • Edges: based on shared review behavior
  • Task: multi-label classification of user types
  • Challenge: severe class imbalance, large scale

Used to benchmark scalable methods (GraphSAINT, Cluster-GCN).

ML Tasks on Graphs

GNNs solve three fundamental task types, each requiring different model heads:

Node-Level Tasks

Predict a property of each node. The GNN computes a node embedding hvRdh_v \in \mathbb{R}^d for each node, then passes it through a linear classifier:

y^v=softmax(Wclshv+b)\hat{y}_v = \text{softmax}(W_{\text{cls}} h_v + b)

Loss: cross-entropy on labeled nodes only (semi-supervised setting).

Examples: paper topic classification (Cora), user type prediction (Yelp), fraud detection, protein function prediction.

Predict whether an edge should exist between two nodes. The GNN computes node embeddings, then scores each node pair:

y^uv=σ(huhv)\hat{y}_{uv} = \sigma(h_u^\top h_v) or σ(MLP([huhv]))\sigma(\text{MLP}([h_u \| h_v]))

Training: typically with contrastive loss - maximize score for existing edges, minimize for random non-edges.

Examples: friend recommendation, drug-target interaction prediction, knowledge graph completion, citation network link prediction.

Graph-Level Tasks

Predict a property of an entire graph. The GNN computes node embeddings, then pools them into a single graph embedding:

hG=READOUT({hv:vV})h_G = \text{READOUT}(\{h_v : v \in V\})

READOUT is typically sum, mean, or a more expressive differentiable pooling (DiffPool, SAGPool).

Examples: molecular toxicity prediction, drug solubility, chemical property prediction, graph-level classification.

The Neighborhood Aggregation Insight

If CNNs cannot work, what can? The key insight behind all GNNs:

A node's representation should be computed from its own features and the representations of its neighbors. Recursively.

This is called neighborhood aggregation or message passing. At layer 0, every node starts with its raw features. At each layer, every node aggregates messages from its neighbors and updates its own representation:

hv(0)=xv(raw features)h_v^{(0)} = x_v \quad \text{(raw features)}

hv(k)=σ ⁣(W(k)f ⁣(hv(k1),{hu(k1):uN(v)}))h_v^{(k)} = \sigma\!\left(W^{(k)} \cdot f\!\left(h_v^{(k-1)},\, \left\{h_u^{(k-1)} : u \in \mathcal{N}(v)\right\}\right)\right)

After kk layers, hv(k)h_v^{(k)} encodes a summary of the kk-hop neighborhood around vv.

The crucial property: the aggregation function ff must be permutation invariant - the result cannot depend on the arbitrary ordering of the neighbor list. Sum, mean, and max are all permutation invariant.

Different GNN architectures differ in how they define this aggregation:

  • GCN: weighted average with symmetric degree normalization
  • GAT: learned attention weights (different neighbors get different importance)
  • GraphSAGE: sample + aggregate (mean, max-pool, or LSTM)
  • GIN: sum aggregation with MLP (maximally expressive)

The Weisfeiler-Lehman Test and GNN Expressivity

How expressive can a GNN be? This question has a precise answer rooted in a 1968 algorithm: the Weisfeiler-Lehman (WL) graph isomorphism test.

The WL Test

The WL test determines whether two graphs are non-isomorphic (structurally different):

  1. Assign every node the same initial label (or use node features as initial labels)
  2. At each iteration: update each node's label by hashing its current label together with the multiset of its neighbors' labels
  3. If the label distributions of two graphs ever differ, they are not isomorphic
  4. If labels stabilize without differing, the test is inconclusive (may be isomorphic)
def wl_iteration(labels, adj_list):
"""One step of the Weisfeiler-Lehman test."""
new_labels = {}
for node in labels:
# Collect own label + sorted neighbor labels
neighbor_labels = sorted([labels[n] for n in adj_list[node]])
# Hash into a new label
signature = (labels[node], tuple(neighbor_labels))
new_labels[node] = hash(signature)
return new_labels

# Two-node chain vs isolated nodes
chain = {0: [1], 1: [0]} # 0-1
isolated = {0: [], 1: []} # 0 1

labels_chain = {0: 0, 1: 0}
labels_isolated = {0: 0, 1: 0}

# After 1 WL iteration:
# Chain: node 0 gets label hash(0, (0,)), node 1 gets hash(0, (0,))
# Isolated: node 0 gets hash(0, ()), node 1 gets hash(0, ())
# Different! WL correctly identifies them as non-isomorphic.

WL and GNN Expressivity

Xu et al. (2019) proved a fundamental theorem: any GNN using sum aggregation is at most as expressive as the WL test. Specifically:

  • If two nodes have the same WL label at iteration kk, any kk-layer GNN assigns them the same embedding
  • GNNs that use sum aggregation + injective update function achieve WL-level expressivity
  • GNNs that use mean or max aggregation can be strictly less expressive than WL

Why does expressivity matter?

The WL test fails to distinguish certain pairs of non-isomorphic graphs - for example, regular graphs where every node has the same degree. Any GNN that is limited to WL expressivity will assign the same embedding to nodes in these graphs, even if they have structurally different positions.

Practical implications:

  • Mean aggregation (GCN) cannot distinguish whether a node has 2 neighbors with feature [1,0] vs 1 neighbor with feature [1,0] + 1 neighbor with feature [0,1] - they produce the same mean
  • Sum aggregation (GIN) can distinguish these cases
  • For tasks requiring structural discrimination (counting triangles, detecting ring structures in molecules), GIN outperforms GCN

The GIN Architecture

Graph Isomorphism Network (Xu et al., 2019) achieves WL-level expressivity via:

hv(k)=MLP(k) ⁣((1+ε(k))hv(k1)+uN(v)hu(k1))h_v^{(k)} = \text{MLP}^{(k)}\!\left(\left(1 + \varepsilon^{(k)}\right) h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\right)

where ε\varepsilon is either a learned scalar or fixed to 0. The sum (not mean or max) over neighbors, combined with an injective MLP update, makes this maximally expressive within the WL class.

from torch_geometric.nn import GINConv
import torch.nn as nn
import torch.nn.functional as F

class GIN(nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels, num_layers=5):
super().__init__()
self.convs = nn.ModuleList()
self.bns = nn.ModuleList()

for i in range(num_layers):
in_ch = in_channels if i == 0 else hidden_channels
mlp = nn.Sequential(
nn.Linear(in_ch, hidden_channels),
nn.BatchNorm1d(hidden_channels),
nn.ReLU(),
nn.Linear(hidden_channels, hidden_channels),
)
self.convs.append(GINConv(mlp, train_eps=True))
self.bns.append(nn.BatchNorm1d(hidden_channels))

self.classifier = nn.Linear(hidden_channels * num_layers, out_channels)

def forward(self, x, edge_index, batch):
from torch_geometric.nn import global_add_pool
import torch

# Collect embeddings from each layer (jumping knowledge style)
all_embeddings = []
for conv, bn in zip(self.convs, self.bns):
x = F.relu(bn(conv(x, edge_index)))
# Graph-level pooling (for graph classification)
graph_emb = global_add_pool(x, batch)
all_embeddings.append(graph_emb)

# Concatenate all layers
out = torch.cat(all_embeddings, dim=1)
return self.classifier(out)

NetworkX: Building and Exploring Graphs

Before GNNs, understand the graph with NetworkX:

import networkx as nx
import matplotlib.pyplot as plt

# Build a graph from scratch
G = nx.Graph()
G.add_nodes_from([0, 1, 2, 3, 4])
G.add_edges_from([(0, 1), (0, 2), (1, 3), (2, 3), (3, 4)])

# Node attributes
nx.set_node_attributes(G, {0: "A", 1: "B", 2: "C", 3: "D", 4: "E"}, name="label")
nx.set_node_attributes(G, {0: 0, 1: 0, 2: 1, 3: 1, 4: 1}, name="community")

# Edge weights
G[0][1]['weight'] = 0.8
G[1][3]['weight'] = 0.3

# Basic statistics
print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())
print("Degree sequence:", sorted([d for _, d in G.degree()], reverse=True))
print("Is connected:", nx.is_connected(G))
print("Average clustering:", nx.average_clustering(G))
print("Diameter:", nx.diameter(G))
print("Avg shortest path:", nx.average_shortest_path_length(G))

# Centrality measures
degree_centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)

print("\nPageRank scores:")
for node, score in sorted(pagerank.items(), key=lambda x: -x[1]):
print(f" Node {node}: {score:.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Draw graph
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, ax=axes[0], with_labels=True, node_color='lightblue',
node_size=800, edge_color='gray', font_size=12)
axes[0].set_title("Graph Structure")

# Degree distribution
degrees = [d for _, d in G.degree()]
axes[1].hist(degrees, bins=max(degrees)+1, color='steelblue', edgecolor='white')
axes[1].set_xlabel("Degree")
axes[1].set_ylabel("Count")
axes[1].set_title("Degree Distribution")

plt.tight_layout()
plt.savefig("graph_analysis.png", dpi=150)

Converting Between NetworkX and PyG

from torch_geometric.utils import from_networkx, to_networkx
import torch

# NetworkX → PyG Data
data = from_networkx(G, group_node_attrs=['community'])
print(data) # Data(edge_index=[2, 10], community=[5, 1])

# Add node features manually
data.x = torch.randn(5, 16) # 5 nodes, 16 features
data.y = torch.tensor([0, 0, 1, 1, 1]) # node labels

# PyG Data → NetworkX (for analysis)
from torch_geometric.datasets import Planetoid
cora = Planetoid(root='/tmp/cora', name='Cora')[0]
G_cora = to_networkx(cora, to_undirected=True)
print(f"Cora: {G_cora.number_of_nodes()} nodes, {G_cora.number_of_edges()} edges")

Scale Challenges: Billion-Node Graphs

Understanding graphs at small scale is very different from deploying GNNs on the graphs that matter in production.

Meta's Social Graph

  • Nodes: ~3 billion users + ~10 billion items
  • Edges: hundreds of billions of social connections
  • Task: content ranking, friend recommendations, abuse detection

Full adjacency matrix: impossible to store (102110^{21} bytes). Even the edge list at 64-bit precision requires dozens of petabytes. Training a GNN with full-graph forward passes is completely infeasible.

Solutions used in practice:

  • Mini-batch training with neighborhood sampling (GraphSAGE / PinSage approach)
  • Cluster-based training (partition graph into clusters, train within clusters)
  • Distributed graph engines (Facebook's Gandalf, Amazon's DGL distributed)
  • Feature-server architectures that serve node features on demand during training

LinkedIn's Economic Graph

LinkedIn's "Economic Graph" connects people, companies, jobs, schools, and skills. GNNs on this graph power job recommendations, people you may know, and skills endorsement ranking. LinkedIn uses GraphSAGE-style inductive learning so that new users (millions per day) get embeddings immediately without retraining.

Memory Layout Considerations

For a graph with n=106n = 10^6 nodes, m=107m = 10^7 edges, and d=256d = 256 features:

ComponentStorage
Node features (n×dn \times d, float32)106×256×410^6 \times 256 \times 4 = 1 GB
Edge index (2×m2 \times m, int64)2×107×82 \times 10^7 \times 8 = 160 MB
Node embeddings (n×dn \times d, float32)1 GB
Adjacency matrix (dense, float32)1012×410^{12} \times 4 = 4 TB - infeasible

Sparse storage (CSR or COO) is mandatory for all practical graph ML.

Production Applications Overview

DomainGraph setupTaskGNN typeScale
Drug discoveryAtom-bond molecular graphsMolecular property predictionMPNN, SchNetMillions of molecules
Fraud detectionAccount-device-IP-phone graphsNode classificationGraphSAGE, GATBillions of nodes (Alipay, PayPal)
RecommendationUser-item interaction graphsLink predictionPinSage, LightGCNPinterest: 3B nodes, 18B edges
Knowledge graphsEntity-relation triplesKG completionR-GCN, CompGCNWikidata: 80M items
Traffic forecastingRoad network graphsGraph regressionDCRNN, STGCNCity-scale road networks
Protein structureResidue interaction networksFunction predictionGVP-GNN, EquiGNNUniProt: 220M sequences

YouTube Resources

VideoChannelWhy Watch
Graph Neural Networks, a review of methods and applicationsStanford CS224WJure Leskovec's overview - the definitive introduction
CS224W: Machine Learning with Graphs (Lecture 1)Stanford OnlineFull course lecture with examples and motivation
An Introduction to Graph Neural NetworksAleksa GordicAnimated explainer with intuition-first approach
Graph Representation Learning - William HamiltonMontreal AI & NeuroscienceAuthor of GraphSAGE explains the field
Why Graph Neural Networks? - DeepMindDeepMindDeepMind's motivation from scientific computing angle

Production Engineering Notes

:::note Graph EDA Is Non-Negotiable Always compute basic graph statistics before choosing a GNN architecture. Degree distribution, homophily ratio, clustering coefficient, and connected component analysis all inform architecture and hyperparameter choices. A highly heterophilic graph (homophily < 0.3) needs a different approach than a homophilic citation network. :::

:::warning Memory Scales as O(n²) for Dense Representations The adjacency matrix for even a modest 100K-node graph requires 40 GB as float32. Always use sparse representations (edge_index in PyG). If someone asks you to construct an adjacency matrix for a large graph, immediately question whether a sparse alternative works instead. :::

:::danger Never Ignore Graph Diameter When Choosing Depth A common mistake: setting GNN depth to "match the diameter" so all nodes can communicate. With diameter 20, you think you need 20 layers. In practice, over-smoothing kills performance after 3–4 layers. The optimal depth is almost always 2–3 regardless of diameter. Use global pooling, jumping knowledge, or other mechanisms if you need long-range information. :::

Interview Q&A

Q1: What is the fundamental limitation of applying CNNs to graph-structured data?

CNNs require a fixed, regular topology with a canonical ordering of neighbors. Graphs have variable-degree nodes with no natural ordering of neighbors. A 3×3 CNN kernel cannot apply to a node with 47 neighbors, and even if you padded it, the ordering of those 47 neighbors is arbitrary - permuting them would give different outputs, which is wrong. Additionally, CNNs exploit translational equivariance - the same filter pattern is valid everywhere in an image - but graphs have no such spatial invariance. GNNs solve this by using permutation-invariant aggregation (sum, mean, max) over the set of neighbors rather than ordered convolutions.

Q2: Define the graph Laplacian L=DAL = D - A. Why is it important for GNNs?

The graph Laplacian L=DAL = D - A where DD is the diagonal degree matrix (Dii=jAijD_{ii} = \sum_j A_{ij}) and AA is the adjacency matrix. The Laplacian has three key properties: it is symmetric positive semidefinite (all eigenvalues 0\geq 0), its smallest eigenvalue is 0 (with eigenvector 1\mathbf{1}), and its quadratic form xLx=(i,j)E(xixj)2x^\top L x = \sum_{(i,j)\in E}(x_i - x_j)^2 measures signal variation across edges. The normalized Laplacian L=D1/2LD1/2\mathcal{L} = D^{-1/2}LD^{-1/2} defines the graph Fourier transform via eigendecomposition L=UΛU\mathcal{L} = U\Lambda U^\top. Spectral GNNs (ChebNet, GCN) derive their operations from polynomial approximations of filters in this Fourier basis. GCN's symmetric normalization D~1/2A~D~1/2\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} is directly the first-order approximation of spectral convolution on this Laplacian.

Q3: Explain the Weisfeiler-Lehman test. What does it tell us about GNN expressivity?

The WL test checks graph non-isomorphism by iteratively relabeling each node with a hash of its own label and the multiset of its neighbors' labels. If two graphs' label distributions ever differ, they are non-isomorphic. Xu et al. (2019) proved that any GNN with a permutation-invariant aggregation is at most as powerful as the 1-WL test - if the WL test cannot distinguish two nodes, neither can the GNN. Furthermore, GNNs using sum aggregation with injective update functions (GIN) are exactly as powerful as WL. Mean and max aggregation are strictly weaker - they cannot distinguish, for example, a node with two neighbors of feature [1] from a node with one neighbor of feature [1] (mean gives 1 in both cases). This matters for tasks requiring counting of structural patterns like triangles, cycles, or motifs.

Q4: Name three real-world graph datasets and explain what makes each suitable for benchmarking different aspects of GNNs.

Cora (2708 nodes, 5429 edges): small citation network, high homophily (h0.81h \approx 0.81), fast to train. Tests basic GNN correctness and semi-supervised learning with very few labels (140 for 7 classes). ogbn-arxiv (169K nodes, 1.2M edges): medium-scale citation network, tests scalability beyond small graphs, uses realistic Arxiv paper data. Reddit (232K nodes, 114M edges): large-scale social network requiring mini-batch training - tests sampling-based scalability. You cannot run full-batch GCN on Reddit; it forces you to use NeighborLoader or Cluster-GCN. Together these three test different dimensions: correctness, medium-scale generalization, and production-scale training.

Q5: What is homophily and why does it affect GNN architecture choice?

Homophily ratio h={(u,v)E:yu=yv}/Eh = |\{(u,v) \in E : y_u = y_v\}| / |E| measures how often connected nodes share the same class label. High homophily (h1h \approx 1): neighbors tend to be in the same class - citation networks (same-topic papers cite each other), social networks (birds of a feather flock together). Low homophily / heterophily (h0h \approx 0): neighbors tend to differ - dating networks (opposite attract), certain bipartite graphs. GCN and most simple GNNs assume homophily implicitly: they average neighbor features, which makes sense if neighbors have similar labels. On heterophilic graphs, averaging neighbor features corrupts the node's own representation with opposing-class features. Specialized architectures (H2GCN, GPR-GNN, FAGCN) address heterophily by combining the node's own features with neighbor features at different scales or with signed attention.

Q6: Describe the three ML task types on graphs and the corresponding model head design.

Node classification: GNN computes embedding hvRdh_v \in \mathbb{R}^d for each node; head is a linear layer y^v=softmax(Whv)\hat{y}_v = \text{softmax}(Wh_v); loss is cross-entropy on labeled nodes; training uses all edges for message passing but loss on labeled subset only (semi-supervised). Link prediction: GNN computes node embeddings; head scores each candidate edge pair: y^uv=σ(huhv)\hat{y}_{uv} = \sigma(h_u^\top h_v) or σ(MLP([huhv]))\sigma(\text{MLP}([h_u \| h_v])); trained with binary cross-entropy or contrastive loss with negative sampling. Graph classification: GNN computes per-node embeddings, then READOUT (sum/mean/attention pooling) produces a graph-level vector hGh_G; head is y^G=softmax(WhG)\hat{y}_G = \text{softmax}(Wh_G); trained with standard cross-entropy on labeled graphs. The READOUT choice matters: sum is maximally expressive (GIN), mean loses count information, learned pooling (DiffPool) can capture hierarchical structure.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Graph Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.