What is graph machine learning?

When tabular data fails - graph formalism, adjacency matrix, Laplacian, graph types, real-world datasets, the Weisfeiler-Lehman test, and why CNNs cannot handle graph-structured data.

How does relational data work in practice?

Why Graphs for ML covers graph machine learning, relational data, node classification from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/graph-neural-networks/why-graphs-for-ml

What is the difference between graph machine learning and node classification?

See the full breakdown at https://engineersofai.com/docs/ml/graph-neural-networks/why-graphs-for-ml

Why Graphs for ML

The Real Interview Moment

You are interviewing for an ML engineer role at a fintech company. The interviewer describes a fraud detection problem: each transaction has 47 features - amount, merchant category, device type, location, time. You have trained a gradient boosted tree that catches 80% of fraud with 0.1% false positives.

"Good," they say. "Now. We have a new signal. Fraudulent accounts often share devices, phone numbers, or shipping addresses with other fraudulent accounts - forming rings. How would you use that?"

You pause. This is relational information. The tree model has no concept of connections between accounts. You would need to join on device ID, aggregate features from the linked accounts, engineer "number of flagged accounts sharing this device" - laboriously, statically, and without capturing multi-hop patterns. If a fraudulent account is separated from another by two intermediate clean-looking accounts, your hand-engineered feature misses it entirely.

Graph neural networks solve this directly. A GNN operating on the account-device-address graph propagates fraud signals through connections automatically. An account linked (through a device) to (through an IP address) to a known fraudster gets a contaminated embedding within two message-passing steps. No feature engineering required. No manually specified hop depth. The model learns how far to look and what to aggregate.

This is the core promise of graph machine learning: move beyond the independent-rows assumption of tabular ML and operate directly on the structure of real-world relational systems.

Why This Exists - The Failure of Tabular ML on Relational Data

Tabular machine learning is built on a foundational assumption: each row is an independent and identically distributed (i.i.d.) sample drawn from the same distribution. The features describe what an entity is, not how it relates to other entities.

This assumption holds for many classical problems - predicting house prices given square footage and location, classifying emails given word frequencies, forecasting demand given historical sales. In these cases, the relationship between rows is irrelevant (or encoded indirectly through shared categorical variables).

But for a vast class of real-world problems, the critical information lives not in individual entities but in the connections between them. A drug molecule is not just a bag of atoms - it is a specific arrangement of atoms connected by bonds of specific types. A paper in a citation network is not just a bag of words - its authority depends on what other papers cite it and what those papers say. A user in a social network is not just a demographic profile - their purchasing behavior is shaped by their friends' purchases.

When relational information matters, tabular ML forces you to hand-engineer relational features: "number of friends who purchased item X," "average rating from users within 2 hops," "number of shared devices with known fraudsters." This engineering is manual, static, and limited to whatever hops and aggregations you thought to include. You must decide the structure of the feature before training, without any signal from the data about which relational patterns are actually predictive.

Graph ML learns these patterns automatically from the structure of the graph itself.

Graph Structure Formalism

A graph $G = (V, E, X)$ consists of:

Nodes $V = \{v_1, v_2, \ldots, v_n\}$ - entities (users, atoms, papers, accounts). $n = |V|$ is the number of nodes.
Edges $E \subseteq V \times V$ - relationships (friendship, bond, citation, transaction). $m = |E|$ is the number of edges.
Node feature matrix $X \in \mathbb{R}^{n \times d}$ - each row $\mathbf{x}_v \in \mathbb{R}^d$ is the feature vector of node $v$ .
Edge feature matrix $X_E \in \mathbb{R}^{m \times d_e}$ - optional features on each edge.
Graph-level label $y_G$ - for graph classification tasks.

Molecular graph example (aspirin):

  O
  ‖
  C - O - H        (ester group)
  |
  C === C           (aromatic ring)
 / \   / \
C   C=C   C
 \  |   \ /
  C=C   (aromatic)

Nodes: Carbon (C), Oxygen (O), Hydrogen (H)
Node features: atomic number, charge, hybridization, aromaticity flag
Edges: chemical bonds
Edge features: bond type (single/double/aromatic), bond length (Å), bond angle

The Adjacency Matrix

The adjacency matrix $A \in \{0,1\}^{n \times n}$ encodes connectivity:

$A_{ij} = \begin{cases} 1 & \text{if edge } (v_i, v_j) \in E \\ 0 & \text{otherwise} \end{cases}$

For an undirected graph, $A$ is symmetric: $A_{ij} = A_{ji}$ .

Example - triangle graph (nodes 0, 1, 2, all connected):

$A = \begin{pmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{pmatrix}$

The degree of node $i$ is $d_i = \sum_j A_{ij}$ - the number of edges incident to node $i$ .

The degree matrix $D$ is diagonal: $D_{ii} = d_i$ .

The Graph Laplacian

The unnormalized graph Laplacian is:

$L = D - A$

For the triangle graph:

$L = \begin{pmatrix} 2 & -1 & -1 \\ -1 & 2 & -1 \\ -1 & -1 & 2 \end{pmatrix}$

The Laplacian has elegant properties:

Symmetric and positive semidefinite (all eigenvalues $\geq 0$ )
Smallest eigenvalue is 0 (with eigenvector $\mathbf{1}/\sqrt{n}$ )
Number of zero eigenvalues equals the number of connected components
Quadratic form: $\mathbf{x}^\top L \mathbf{x} = \sum_{(i,j)\in E}(x_i - x_j)^2$ - measures signal variation across edges

The normalized Laplacian stabilizes eigenvalues to $[0, 2]$ :

$\mathcal{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}$

This is the foundation of spectral graph theory: the eigendecomposition $\mathcal{L} = U\Lambda U^\top$ defines the "graph Fourier transform," analogous to the DFT for regular signals. Convolution in the graph frequency domain is multiplication by $g(\Lambda)$ .

import numpy as np

def compute_laplacian(A, normalized=True):
    """Compute graph Laplacian from adjacency matrix."""
    n = A.shape[0]
    degrees = A.sum(axis=1)            # d_i = sum of row i
    D = np.diag(degrees)               # degree matrix
    L = D - A                          # unnormalized Laplacian

    if not normalized:
        return L

    # Normalized Laplacian: L_sym = D^{-1/2} L D^{-1/2}
    d_inv_sqrt = np.where(degrees > 0, 1.0 / np.sqrt(degrees), 0.0)
    D_inv_sqrt = np.diag(d_inv_sqrt)
    L_sym = D_inv_sqrt @ L @ D_inv_sqrt
    return L_sym

# Triangle graph
A = np.array([[0,1,1],[1,0,1],[1,1,0]], dtype=float)
L = compute_laplacian(A, normalized=False)
L_sym = compute_laplacian(A, normalized=True)

# Eigenvalues
eigenvalues = np.linalg.eigvalsh(L_sym)
print("Eigenvalues of normalized Laplacian:", eigenvalues)
# [0.0, 1.0, 1.5]  (always in [0, 2])

Graph Types

Real-world graphs come in many flavors. Choosing the right GNN depends on understanding which type you are dealing with.

Directed vs Undirected

Undirected: edges have no direction. If $(u, v) \in E$ , then $(v, u) \in E$ . Friendships, molecular bonds, protein interactions.

Directed: edges have a direction. $(u, v) \in E$ does not imply $(v, u) \in E$ . Twitter follows (A follows B ≠ B follows A), citations (paper A cites paper B).

The adjacency matrix of a directed graph is not symmetric. Message passing must distinguish in-edges from out-edges.

Bipartite Graphs

Two disjoint node sets $U$ and $V$ ; edges only between sets, never within. User-item interaction graphs (recommendations), document-word graphs (topic modeling), drug-target networks.

$G = (U, V, E), \quad E \subseteq U \times V$

Bipartite structure is exploited by specialized architectures. Many recommendation GNNs (LightGCN, NGCF) treat the user-item graph as bipartite.

Heterogeneous Graphs

Multiple node types and edge types. A knowledge graph has entities (people, places, concepts) connected by different relation types (born_in, works_at, part_of). Academic networks have papers, authors, venues, with different edge types (authored, published_at, cites).

A heterogeneous graph is defined as $G = (V, E, \phi, \psi)$ where $\phi: V \to \mathcal{A}$ maps nodes to types and $\psi: E \to \mathcal{R}$ maps edges to relation types.

Specialized architectures (HAN, RGCN, HGT) handle heterogeneous graphs by learning separate transformations per relation type.

Dynamic Graphs

Edges appear and disappear over time. A transaction network changes with every new transaction. A social network changes as friendships form and dissolve.

Dynamic GNNs (TGAT, TGN) incorporate temporal information - when was the edge created, how long ago, what is the recency weighting.

Knowledge Graphs

Directed graphs of (entity, relation, entity) triples: (Einstein, born_in, Germany), (Germany, located_in, Europe). Relation types have semantic meaning. Models like TransE, DistMult, RotatE, and GNN-based approaches learn entity embeddings that predict missing triples (knowledge graph completion).

Why CNNs Cannot Work on Graphs

CNNs revolutionized image processing by exploiting the grid structure of pixels. The key properties CNNs require:

Fixed, regular topology - every interior pixel has exactly 8 neighbors in the same relative arrangement
Spatial locality - nearby pixels are semantically related
Translational equivariance - a feature detector valid at position (10, 10) is valid at (50, 50)
Canonical ordering - the 3×3 patch around any pixel has a fixed, consistent layout

Graphs violate all four:

Grid (image):                Graph:

■ ■ ■ ■ ■               A --- B
■ ■ ■ ■ ■                \   / \
■ ■ ■ ■ ■                 C     D
■ ■ ■ ■ ■                       |
■ ■ ■ ■ ■                       E

Every pixel: 8 neighbors.     Degrees: A:2, B:3, C:2, D:2, E:1
Fixed 3×3 kernel applies.     No kernel size fits all nodes.
Same at every location.       No positional reference frame.

Variable neighborhood size: A CNN kernel has a fixed footprint (3×3 = 9 inputs). A graph node might have 1 neighbor or 10,000 neighbors. You cannot design a single kernel that handles this.

No canonical ordering: The 8 pixels around a pixel have a natural order (top-left, top, top-right, ...). The neighbors of a graph node have no natural order. Any permutation of the neighbor list must produce the same output - operations must be permutation invariant.

No shared spatial structure: CNN translational equivariance assumes the same visual pattern (edge, curve, eye) appears the same way wherever it appears in the image. In a graph, two nodes at the "same position" in different subgraphs may have completely different structural roles. The concept of "position" does not generalize.

Why this matters architecturally: Every major GNN operation replaces ordered convolution with permutation-invariant aggregation over the neighbor set:

$h_v^{(k)} = \sigma\!\left(W \cdot \text{AGGREGATE}\!\left(\left\{h_u^{(k-1)} : u \in \mathcal{N}(v)\right\}\right)\right)$

Sum, mean, and max are all permutation invariant. Concatenation followed by MLP is not (unless you sort, which introduces arbitrary bias).

Graph Properties and Statistics

Understanding your graph's structure before training a GNN is as important as exploring your dataset before fitting any ML model.

Degree Distribution

The degree of node $v$ is $d_v = |\mathcal{N}(v)|$ - the number of neighbors. Real-world graphs often follow a power-law degree distribution:

$P(d) \propto d^{-\gamma}, \quad \gamma \approx 2\text{–}3$

This means a few "hub" nodes have enormous degree while most nodes have very few connections. Social networks, the web, and protein interaction networks all exhibit this. It has profound implications for GNNs: hub nodes will dominate naive aggregation, and degree normalization becomes essential.

Clustering Coefficient

The clustering coefficient of node $v$ measures how many of its neighbors are also connected to each other:

$C_v = \frac{|\{(u,w) \in E : u, w \in \mathcal{N}(v)\}|}{d_v(d_v - 1)/2}$

$C_v = 1$ means all neighbors form a clique. $C_v = 0$ means no two neighbors are connected. High clustering implies strong community structure - a useful inductive bias.

Graph Diameter

The diameter is the longest shortest path between any two nodes:

$\text{diam}(G) = \max_{u,v \in V} d(u, v)$

where $d(u, v)$ is the shortest path length between $u$ and $v$ . The diameter upper-bounds the number of GNN layers needed to let any two nodes communicate. For Cora (diameter ~20), you might think you need 20 layers - but over-smoothing makes this impossible. In practice, 2–3 layers suffice because the graph has short effective diameters due to clustering.

Homophily Ratio

The homophily ratio measures how often connected nodes share the same label:

$h = \frac{|\{(u,v) \in E : y_u = y_v\}|}{|E|}$

$h \approx 1$ : connected nodes tend to have the same class (citation networks). $h \approx 0$ : connected nodes tend to have different classes (some dating networks). High homophily makes GCN work well. Low homophily (heterophily) requires architectures like GAT or H2GCN that can distinguish own from neighbor features.

import networkx as nx
import numpy as np

def graph_statistics(G):
    """Compute key graph statistics for EDA."""
    n = G.number_of_nodes()
    m = G.number_of_edges()
    degrees = [d for _, d in G.degree()]

    print(f"Nodes: {n:,}")
    print(f"Edges: {m:,}")
    print(f"Density: {2*m / (n*(n-1)):.6f}")
    print(f"Mean degree: {np.mean(degrees):.2f}")
    print(f"Median degree: {np.median(degrees):.1f}")
    print(f"Max degree: {max(degrees)}")
    print(f"Avg clustering coeff: {nx.average_clustering(G):.4f}")

    # Connected components
    components = list(nx.connected_components(G))
    print(f"Connected components: {len(components)}")
    largest = max(components, key=len)
    print(f"Largest component: {len(largest)} nodes ({100*len(largest)/n:.1f}%)")

    # Diameter (only feasible on small graphs or largest component)
    if n < 5000:
        subgraph = G.subgraph(largest)
        print(f"Diameter: {nx.diameter(subgraph)}")

    return degrees

# Example with Cora loaded via NetworkX
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import to_networkx

dataset = Planetoid(root='/tmp/cora', name='Cora')
data = dataset[0]
G = to_networkx(data, to_undirected=True)
graph_statistics(G)

Real-World Graph Datasets

Cora - Citation Network

Nodes: 2,708 scientific papers
Edges: 5,429 citation links (undirected)
Node features: 1,433-dimensional bag-of-words (binary - word presence)
Labels: 7 topics (Neural Networks, Rule Learning, Case Based, etc.)
Task: node classification - predict paper topic
Split: 140 train / 500 val / 1000 test (the classic Kipf split)
Homophily: ~0.81 (high - papers tend to cite papers in the same topic)

Cora is the MNIST of graph ML. Every GNN paper reports accuracy on Cora. It is small enough to train instantly but structured enough to reveal meaningful differences between methods.

OGB Benchmarks - Open Graph Benchmark

The Open Graph Benchmark (Hu et al., 2020) provides standardized, realistic benchmarks at real-world scale:

Dataset	Task	Nodes	Edges	Features
ogbn-arxiv	Node classification	169K papers	1.2M citations	128-dim (Word2Vec)
ogbn-products	Node classification	2.4M products	61M co-purchases	100-dim
ogbn-proteins	Node classification	132K proteins	39M protein interactions	8-dim (edge features)
ogbl-collab	Link prediction	235K authors	1.3M collaborations	128-dim
ogbg-molhiv	Graph classification	1K avg atoms/mol	-	9-dim atom features

OGB replaced Cora as the standard benchmark because Cora is too small to differentiate modern methods. ogbn-arxiv and ogbn-products test scalability seriously.

Nodes: 232,965 Reddit posts
Edges: 114,615,892 (commenter connects post to another post they comment on)
Node features: 602-dimensional (GloVe embedding + engagement stats)
Labels: 41 subreddit communities
Task: node classification
Scale: too large for full-batch training - requires mini-batch with neighbor sampling

Reddit is the standard benchmark for testing scalable GNN training methods. GraphSAGE paper reported 93.9% accuracy. Full-batch GCN is infeasible here.

Yelp - Review Network

Nodes: 716,847 users
Edges: based on shared review behavior
Task: multi-label classification of user types
Challenge: severe class imbalance, large scale

Used to benchmark scalable methods (GraphSAINT, Cluster-GCN).

ML Tasks on Graphs

GNNs solve three fundamental task types, each requiring different model heads:

Node-Level Tasks

Predict a property of each node. The GNN computes a node embedding $h_v \in \mathbb{R}^d$ for each node, then passes it through a linear classifier:

$\hat{y}_v = \text{softmax}(W_{\text{cls}} h_v + b)$

Loss: cross-entropy on labeled nodes only (semi-supervised setting).

Examples: paper topic classification (Cora), user type prediction (Yelp), fraud detection, protein function prediction.

Edge/Link-Level Tasks

Predict whether an edge should exist between two nodes. The GNN computes node embeddings, then scores each node pair:

$\hat{y}_{uv} = \sigma(h_u^\top h_v)$ or $\sigma(\text{MLP}([h_u \| h_v]))$

Training: typically with contrastive loss - maximize score for existing edges, minimize for random non-edges.

Examples: friend recommendation, drug-target interaction prediction, knowledge graph completion, citation network link prediction.

Graph-Level Tasks

Predict a property of an entire graph. The GNN computes node embeddings, then pools them into a single graph embedding:

$h_G = \text{READOUT}(\{h_v : v \in V\})$

READOUT is typically sum, mean, or a more expressive differentiable pooling (DiffPool, SAGPool).

Examples: molecular toxicity prediction, drug solubility, chemical property prediction, graph-level classification.

The Neighborhood Aggregation Insight

If CNNs cannot work, what can? The key insight behind all GNNs:

A node's representation should be computed from its own features and the representations of its neighbors. Recursively.

This is called neighborhood aggregation or message passing. At layer 0, every node starts with its raw features. At each layer, every node aggregates messages from its neighbors and updates its own representation:

$h_v^{(0)} = x_v \quad \text{(raw features)}$

$h_v^{(k)} = \sigma\!\left(W^{(k)} \cdot f\!\left(h_v^{(k-1)},\, \left\{h_u^{(k-1)} : u \in \mathcal{N}(v)\right\}\right)\right)$

After $k$ layers, $h_v^{(k)}$ encodes a summary of the $k$ -hop neighborhood around $v$ .

The crucial property: the aggregation function $f$ must be permutation invariant - the result cannot depend on the arbitrary ordering of the neighbor list. Sum, mean, and max are all permutation invariant.

Different GNN architectures differ in how they define this aggregation:

GCN: weighted average with symmetric degree normalization
GAT: learned attention weights (different neighbors get different importance)
GraphSAGE: sample + aggregate (mean, max-pool, or LSTM)
GIN: sum aggregation with MLP (maximally expressive)

The Weisfeiler-Lehman Test and GNN Expressivity

How expressive can a GNN be? This question has a precise answer rooted in a 1968 algorithm: the Weisfeiler-Lehman (WL) graph isomorphism test.

The WL Test

The WL test determines whether two graphs are non-isomorphic (structurally different):

Assign every node the same initial label (or use node features as initial labels)
At each iteration: update each node's label by hashing its current label together with the multiset of its neighbors' labels
If the label distributions of two graphs ever differ, they are not isomorphic
If labels stabilize without differing, the test is inconclusive (may be isomorphic)

def wl_iteration(labels, adj_list):
    """One step of the Weisfeiler-Lehman test."""
    new_labels = {}
    for node in labels:
        # Collect own label + sorted neighbor labels
        neighbor_labels = sorted([labels[n] for n in adj_list[node]])
        # Hash into a new label
        signature = (labels[node], tuple(neighbor_labels))
        new_labels[node] = hash(signature)
    return new_labels

# Two-node chain vs isolated nodes
chain = {0: [1], 1: [0]}      # 0-1
isolated = {0: [], 1: []}     # 0  1

labels_chain = {0: 0, 1: 0}
labels_isolated = {0: 0, 1: 0}

# After 1 WL iteration:
# Chain: node 0 gets label hash(0, (0,)), node 1 gets hash(0, (0,))
# Isolated: node 0 gets hash(0, ()), node 1 gets hash(0, ())
# Different! WL correctly identifies them as non-isomorphic.

WL and GNN Expressivity

Xu et al. (2019) proved a fundamental theorem: any GNN using sum aggregation is at most as expressive as the WL test. Specifically:

If two nodes have the same WL label at iteration $k$ , any $k$ -layer GNN assigns them the same embedding
GNNs that use sum aggregation + injective update function achieve WL-level expressivity
GNNs that use mean or max aggregation can be strictly less expressive than WL

Why does expressivity matter?

The WL test fails to distinguish certain pairs of non-isomorphic graphs - for example, regular graphs where every node has the same degree. Any GNN that is limited to WL expressivity will assign the same embedding to nodes in these graphs, even if they have structurally different positions.

Practical implications:

Mean aggregation (GCN) cannot distinguish whether a node has 2 neighbors with feature [1,0] vs 1 neighbor with feature [1,0] + 1 neighbor with feature [0,1] - they produce the same mean
Sum aggregation (GIN) can distinguish these cases
For tasks requiring structural discrimination (counting triangles, detecting ring structures in molecules), GIN outperforms GCN

The GIN Architecture

Graph Isomorphism Network (Xu et al., 2019) achieves WL-level expressivity via:

$h_v^{(k)} = \text{MLP}^{(k)}\!\left(\left(1 + \varepsilon^{(k)}\right) h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\right)$

where $\varepsilon$ is either a learned scalar or fixed to 0. The sum (not mean or max) over neighbors, combined with an injective MLP update, makes this maximally expressive within the WL class.

from torch_geometric.nn import GINConv
import torch.nn as nn
import torch.nn.functional as F

class GIN(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=5):
        super().__init__()
        self.convs = nn.ModuleList()
        self.bns = nn.ModuleList()

        for i in range(num_layers):
            in_ch = in_channels if i == 0 else hidden_channels
            mlp = nn.Sequential(
                nn.Linear(in_ch, hidden_channels),
                nn.BatchNorm1d(hidden_channels),
                nn.ReLU(),
                nn.Linear(hidden_channels, hidden_channels),
            )
            self.convs.append(GINConv(mlp, train_eps=True))
            self.bns.append(nn.BatchNorm1d(hidden_channels))

        self.classifier = nn.Linear(hidden_channels * num_layers, out_channels)

    def forward(self, x, edge_index, batch):
        from torch_geometric.nn import global_add_pool
        import torch

        # Collect embeddings from each layer (jumping knowledge style)
        all_embeddings = []
        for conv, bn in zip(self.convs, self.bns):
            x = F.relu(bn(conv(x, edge_index)))
            # Graph-level pooling (for graph classification)
            graph_emb = global_add_pool(x, batch)
            all_embeddings.append(graph_emb)

        # Concatenate all layers
        out = torch.cat(all_embeddings, dim=1)
        return self.classifier(out)

NetworkX: Building and Exploring Graphs

Before GNNs, understand the graph with NetworkX:

import networkx as nx
import matplotlib.pyplot as plt

# Build a graph from scratch
G = nx.Graph()
G.add_nodes_from([0, 1, 2, 3, 4])
G.add_edges_from([(0, 1), (0, 2), (1, 3), (2, 3), (3, 4)])

# Node attributes
nx.set_node_attributes(G, {0: "A", 1: "B", 2: "C", 3: "D", 4: "E"}, name="label")
nx.set_node_attributes(G, {0: 0, 1: 0, 2: 1, 3: 1, 4: 1}, name="community")

# Edge weights
G[0][1]['weight'] = 0.8
G[1][3]['weight'] = 0.3

# Basic statistics
print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())
print("Degree sequence:", sorted([d for _, d in G.degree()], reverse=True))
print("Is connected:", nx.is_connected(G))
print("Average clustering:", nx.average_clustering(G))
print("Diameter:", nx.diameter(G))
print("Avg shortest path:", nx.average_shortest_path_length(G))

# Centrality measures
degree_centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)

print("\nPageRank scores:")
for node, score in sorted(pagerank.items(), key=lambda x: -x[1]):
    print(f"  Node {node}: {score:.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Draw graph
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, ax=axes[0], with_labels=True, node_color='lightblue',
        node_size=800, edge_color='gray', font_size=12)
axes[0].set_title("Graph Structure")

# Degree distribution
degrees = [d for _, d in G.degree()]
axes[1].hist(degrees, bins=max(degrees)+1, color='steelblue', edgecolor='white')
axes[1].set_xlabel("Degree")
axes[1].set_ylabel("Count")
axes[1].set_title("Degree Distribution")

plt.tight_layout()
plt.savefig("graph_analysis.png", dpi=150)

Converting Between NetworkX and PyG

from torch_geometric.utils import from_networkx, to_networkx
import torch

# NetworkX → PyG Data
data = from_networkx(G, group_node_attrs=['community'])
print(data)  # Data(edge_index=[2, 10], community=[5, 1])

# Add node features manually
data.x = torch.randn(5, 16)  # 5 nodes, 16 features
data.y = torch.tensor([0, 0, 1, 1, 1])  # node labels

# PyG Data → NetworkX (for analysis)
from torch_geometric.datasets import Planetoid
cora = Planetoid(root='/tmp/cora', name='Cora')[0]
G_cora = to_networkx(cora, to_undirected=True)
print(f"Cora: {G_cora.number_of_nodes()} nodes, {G_cora.number_of_edges()} edges")

Scale Challenges: Billion-Node Graphs

Understanding graphs at small scale is very different from deploying GNNs on the graphs that matter in production.

Nodes: ~3 billion users + ~10 billion items
Edges: hundreds of billions of social connections
Task: content ranking, friend recommendations, abuse detection

Full adjacency matrix: impossible to store ( $10^{21}$ bytes). Even the edge list at 64-bit precision requires dozens of petabytes. Training a GNN with full-graph forward passes is completely infeasible.

Solutions used in practice:

Mini-batch training with neighborhood sampling (GraphSAGE / PinSage approach)
Cluster-based training (partition graph into clusters, train within clusters)
Distributed graph engines (Facebook's Gandalf, Amazon's DGL distributed)
Feature-server architectures that serve node features on demand during training

LinkedIn's Economic Graph

LinkedIn's "Economic Graph" connects people, companies, jobs, schools, and skills. GNNs on this graph power job recommendations, people you may know, and skills endorsement ranking. LinkedIn uses GraphSAGE-style inductive learning so that new users (millions per day) get embeddings immediately without retraining.

Memory Layout Considerations

For a graph with $n = 10^6$ nodes, $m = 10^7$ edges, and $d = 256$ features:

Component	Storage
Node features ( $n \times d$ , float32)	$10^6 \times 256 \times 4$ = 1 GB
Edge index ( $2 \times m$ , int64)	$2 \times 10^7 \times 8$ = 160 MB
Node embeddings ( $n \times d$ , float32)	1 GB
Adjacency matrix (dense, float32)	$10^{12} \times 4$ = 4 TB - infeasible

Sparse storage (CSR or COO) is mandatory for all practical graph ML.

Production Applications Overview

Domain	Graph setup	Task	GNN type	Scale
Drug discovery	Atom-bond molecular graphs	Molecular property prediction	MPNN, SchNet	Millions of molecules
Fraud detection	Account-device-IP-phone graphs	Node classification	GraphSAGE, GAT	Billions of nodes (Alipay, PayPal)
Recommendation	User-item interaction graphs	Link prediction	PinSage, LightGCN	Pinterest: 3B nodes, 18B edges
Knowledge graphs	Entity-relation triples	KG completion	R-GCN, CompGCN	Wikidata: 80M items
Traffic forecasting	Road network graphs	Graph regression	DCRNN, STGCN	City-scale road networks
Protein structure	Residue interaction networks	Function prediction	GVP-GNN, EquiGNN	UniProt: 220M sequences

YouTube Resources

Video	Channel	Why Watch
Graph Neural Networks, a review of methods and applications	Stanford CS224W	Jure Leskovec's overview - the definitive introduction
CS224W: Machine Learning with Graphs (Lecture 1)	Stanford Online	Full course lecture with examples and motivation
An Introduction to Graph Neural Networks	Aleksa Gordic	Animated explainer with intuition-first approach
Graph Representation Learning - William Hamilton	Montreal AI & Neuroscience	Author of GraphSAGE explains the field
Why Graph Neural Networks? - DeepMind	DeepMind	DeepMind's motivation from scientific computing angle

Production Engineering Notes

:::note Graph EDA Is Non-Negotiable Always compute basic graph statistics before choosing a GNN architecture. Degree distribution, homophily ratio, clustering coefficient, and connected component analysis all inform architecture and hyperparameter choices. A highly heterophilic graph (homophily < 0.3) needs a different approach than a homophilic citation network. :::

:::warning Memory Scales as O(n²) for Dense Representations The adjacency matrix for even a modest 100K-node graph requires 40 GB as float32. Always use sparse representations (edge_index in PyG). If someone asks you to construct an adjacency matrix for a large graph, immediately question whether a sparse alternative works instead. :::

:::danger Never Ignore Graph Diameter When Choosing Depth A common mistake: setting GNN depth to "match the diameter" so all nodes can communicate. With diameter 20, you think you need 20 layers. In practice, over-smoothing kills performance after 3–4 layers. The optimal depth is almost always 2–3 regardless of diameter. Use global pooling, jumping knowledge, or other mechanisms if you need long-range information. :::

Interview Q&A

Q1: What is the fundamental limitation of applying CNNs to graph-structured data?

CNNs require a fixed, regular topology with a canonical ordering of neighbors. Graphs have variable-degree nodes with no natural ordering of neighbors. A 3×3 CNN kernel cannot apply to a node with 47 neighbors, and even if you padded it, the ordering of those 47 neighbors is arbitrary - permuting them would give different outputs, which is wrong. Additionally, CNNs exploit translational equivariance - the same filter pattern is valid everywhere in an image - but graphs have no such spatial invariance. GNNs solve this by using permutation-invariant aggregation (sum, mean, max) over the set of neighbors rather than ordered convolutions.

Q2: Define the graph Laplacian $L = D - A$ . Why is it important for GNNs?

The graph Laplacian $L = D - A$ where $D$ is the diagonal degree matrix ( $D_{ii} = \sum_j A_{ij}$ ) and $A$ is the adjacency matrix. The Laplacian has three key properties: it is symmetric positive semidefinite (all eigenvalues $\geq 0$ ), its smallest eigenvalue is 0 (with eigenvector $\mathbf{1}$ ), and its quadratic form $x^\top L x = \sum_{(i,j)\in E}(x_i - x_j)^2$ measures signal variation across edges. The normalized Laplacian $\mathcal{L} = D^{-1/2}LD^{-1/2}$ defines the graph Fourier transform via eigendecomposition $\mathcal{L} = U\Lambda U^\top$ . Spectral GNNs (ChebNet, GCN) derive their operations from polynomial approximations of filters in this Fourier basis. GCN's symmetric normalization $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ is directly the first-order approximation of spectral convolution on this Laplacian.

Q3: Explain the Weisfeiler-Lehman test. What does it tell us about GNN expressivity?

The WL test checks graph non-isomorphism by iteratively relabeling each node with a hash of its own label and the multiset of its neighbors' labels. If two graphs' label distributions ever differ, they are non-isomorphic. Xu et al. (2019) proved that any GNN with a permutation-invariant aggregation is at most as powerful as the 1-WL test - if the WL test cannot distinguish two nodes, neither can the GNN. Furthermore, GNNs using sum aggregation with injective update functions (GIN) are exactly as powerful as WL. Mean and max aggregation are strictly weaker - they cannot distinguish, for example, a node with two neighbors of feature [1] from a node with one neighbor of feature [1] (mean gives 1 in both cases). This matters for tasks requiring counting of structural patterns like triangles, cycles, or motifs.

Q4: Name three real-world graph datasets and explain what makes each suitable for benchmarking different aspects of GNNs.

Cora (2708 nodes, 5429 edges): small citation network, high homophily ( $h \approx 0.81$ ), fast to train. Tests basic GNN correctness and semi-supervised learning with very few labels (140 for 7 classes). ogbn-arxiv (169K nodes, 1.2M edges): medium-scale citation network, tests scalability beyond small graphs, uses realistic Arxiv paper data. Reddit (232K nodes, 114M edges): large-scale social network requiring mini-batch training - tests sampling-based scalability. You cannot run full-batch GCN on Reddit; it forces you to use NeighborLoader or Cluster-GCN. Together these three test different dimensions: correctness, medium-scale generalization, and production-scale training.

Q5: What is homophily and why does it affect GNN architecture choice?

Homophily ratio $h = |\{(u,v) \in E : y_u = y_v\}| / |E|$ measures how often connected nodes share the same class label. High homophily ( $h \approx 1$ ): neighbors tend to be in the same class - citation networks (same-topic papers cite each other), social networks (birds of a feather flock together). Low homophily / heterophily ( $h \approx 0$ ): neighbors tend to differ - dating networks (opposite attract), certain bipartite graphs. GCN and most simple GNNs assume homophily implicitly: they average neighbor features, which makes sense if neighbors have similar labels. On heterophilic graphs, averaging neighbor features corrupts the node's own representation with opposing-class features. Specialized architectures (H2GCN, GPR-GNN, FAGCN) address heterophily by combining the node's own features with neighbor features at different scales or with signed attention.

Q6: Describe the three ML task types on graphs and the corresponding model head design.

Node classification: GNN computes embedding $h_v \in \mathbb{R}^d$ for each node; head is a linear layer $\hat{y}_v = \text{softmax}(Wh_v)$ ; loss is cross-entropy on labeled nodes; training uses all edges for message passing but loss on labeled subset only (semi-supervised). Link prediction: GNN computes node embeddings; head scores each candidate edge pair: $\hat{y}_{uv} = \sigma(h_u^\top h_v)$ or $\sigma(\text{MLP}([h_u \| h_v]))$ ; trained with binary cross-entropy or contrastive loss with negative sampling. Graph classification: GNN computes per-node embeddings, then READOUT (sum/mean/attention pooling) produces a graph-level vector $h_G$ ; head is $\hat{y}_G = \text{softmax}(Wh_G)$ ; trained with standard cross-entropy on labeled graphs. The READOUT choice matters: sum is maximally expressive (GIN), mean loses count information, learned pooling (DiffPool) can capture hierarchical structure.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Graph Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Failure of Tabular ML on Relational Data​

Graph Structure Formalism​

The Adjacency Matrix​

The Graph Laplacian​

Graph Types​

Directed vs Undirected​

Bipartite Graphs​

Heterogeneous Graphs​

Dynamic Graphs​

Knowledge Graphs​

Why CNNs Cannot Work on Graphs​

Graph Properties and Statistics​

Degree Distribution​

Clustering Coefficient​

Graph Diameter​

Homophily Ratio​

Real-World Graph Datasets​

Cora - Citation Network​

OGB Benchmarks - Open Graph Benchmark​

Reddit - Social Network Posts​

Yelp - Review Network​

ML Tasks on Graphs​

Node-Level Tasks​

Edge/Link-Level Tasks​

Graph-Level Tasks​

The Neighborhood Aggregation Insight​

The Weisfeiler-Lehman Test and GNN Expressivity​

The WL Test​

WL and GNN Expressivity​

The GIN Architecture​

NetworkX: Building and Exploring Graphs​

Converting Between NetworkX and PyG​

Scale Challenges: Billion-Node Graphs​

Meta's Social Graph​

LinkedIn's Economic Graph​

Memory Layout Considerations​

Production Applications Overview​

YouTube Resources​

Production Engineering Notes​

Interview Q&A​