Skip to main content

Graph Representations - Adjacency Matrix, Edge List, and ML Tradeoffs

Reading time: ~20 minutes | Level: Graph Theory → ML Engineering

Your GNN training run is using 40 GB of memory for a graph with only 100,000 nodes. The culprit: you stored the adjacency matrix as dense float32 - 105×105×410^5 \times 10^5 \times 4 bytes = 40 GB. The same graph stored as an edge list would take 3 MB.

Choosing the right graph representation is not a theoretical exercise. It directly determines whether your GNN fits in memory, whether batching is efficient, and which operations are fast versus prohibitively slow.

What You Will Learn

  • Adjacency matrix: dense and sparse, properties, algebraic uses
  • Adjacency list: the standard for graph traversal
  • Edge list (COO format): the standard for GNN frameworks
  • Incidence matrix: edges as first-class objects
  • Memory and compute trade-offs for each format
  • PyTorch Geometric's edge_index format
  • Batching multiple graphs for mini-batch GNN training

Part 1 - The Adjacency Matrix

Definition

For a graph G=(V,E)G = (V, E) with n=Vn = |V| vertices, the adjacency matrix A{0,1}n×nA \in \{0, 1\}^{n \times n} is:

Aij={1if (i,j)E0otherwiseA_{ij} = \begin{cases} 1 & \text{if } (i, j) \in E \\ 0 & \text{otherwise} \end{cases}

For undirected graphs: A=ATA = A^T (symmetric). For weighted graphs: Aij=w(i,j)A_{ij} = w(i,j).

Properties and ML connections

import numpy as np
import networkx as nx

G = nx.karate_club_graph()
A = nx.to_numpy_array(G, dtype=np.float32)
print(f"Adjacency matrix shape: {A.shape}") # (34, 34)
print(f"Symmetric: {np.allclose(A, A.T)}") # True (undirected)
print(f"Non-zeros: {int(A.sum())}") # 2 × number of edges

# Property 1: Row sums = degree
degrees = A.sum(axis=1)
print(f"First 5 degrees: {degrees[:5]}")

# Property 2: A^k[i,j] counts the number of walks of length k from i to j
A_power_2 = A @ A
print(f"\nA²[0,0] = {A_power_2[0,0]:.0f} = number of walks of length 2 from node 0 to itself")
print(f"(This equals the degree of node 0: {int(degrees[0])})")

# Property 3: Eigenvalues encode global graph structure
eigenvalues = np.linalg.eigvalsh(A) # eigvalsh: symmetric, returns sorted eigenvalues
print(f"\nLargest eigenvalue (spectral radius): {eigenvalues[-1]:.4f}")
print(f"Second largest eigenvalue: {eigenvalues[-2]:.4f}")
# Gap between eigenvalues relates to community structure

# Property 4: Memory cost
print(f"\nDense adjacency memory: {A.nbytes / 1e3:.1f} KB") # Fine for n=34
# For n=10^6: 10^6 × 10^6 × 4 bytes = 4 TB - impossible!

When to use the adjacency matrix

Use CaseUse Matrix?Why
Small graphs (n<5000n < 5000)YesDense BLAS operations are fast
Spectral computations (eigenvalues)Yes (sparse)scipy.sparse.linalg.eigsh
Linear algebra operations (AkA^k, eAe^A)Yes (sparse)Matrix power is natural
GNN propagation on large graphsNoToo much memory; use edge_index
Graph traversal (BFS, DFS)NoAdjacency list is faster

Normalized adjacency matrix (GCN)

In Graph Convolutional Networks, the raw adjacency matrix is normalized to prevent vanishing/exploding features:

A^=D~1/2A~D~1/2,A~=A+I\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}, \quad \tilde{A} = A + I

import scipy.sparse as sp
import numpy as np

def compute_normalized_adjacency(A: np.ndarray) -> np.ndarray:
"""
Compute symmetrically normalized adjacency matrix for GCN.
à = D^{-1/2} (A + I) D^{-1/2}
"""
A_hat = A + np.eye(A.shape[0]) # Add self-loops
D = np.diag(A_hat.sum(axis=1)) # Degree matrix
D_inv_sqrt = np.diag(1.0 / np.sqrt(np.diag(D) + 1e-10))
return D_inv_sqrt @ A_hat @ D_inv_sqrt

# For large graphs: use sparse matrices
def compute_normalized_adjacency_sparse(A_sparse: sp.spmatrix) -> sp.spmatrix:
"""Sparse version - the only feasible option for large graphs."""
n = A_sparse.shape[0]
A_hat = A_sparse + sp.eye(n, dtype=np.float32)
degrees = np.array(A_hat.sum(axis=1)).flatten()
d_inv_sqrt = np.power(degrees, -0.5)
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0
D_inv_sqrt = sp.diags(d_inv_sqrt)
return D_inv_sqrt @ A_hat @ D_inv_sqrt

Part 2 - Adjacency List

The adjacency list representation stores, for each vertex vv, the list of its neighbors:

adj[v]={u:(v,u)E}\text{adj}[v] = \{u : (v, u) \in E\}

# Python adjacency list: dict of sets or dict of lists
class AdjacencyList:
"""Memory-efficient adjacency list for sparse graphs."""

def __init__(self, directed: bool = False):
self.adj = {}
self.directed = directed

def add_node(self, v):
if v not in self.adj:
self.adj[v] = []

def add_edge(self, u, v, weight: float = 1.0):
self.add_node(u)
self.add_node(v)
self.adj[u].append((v, weight))
if not self.directed:
self.adj[v].append((u, weight))

def neighbors(self, v):
return [u for u, _ in self.adj.get(v, [])]

def degree(self, v):
return len(self.adj.get(v, []))

def memory_bytes(self) -> int:
"""Approximate memory usage."""
# Each edge entry: ~64 bytes (Python object overhead)
total_edges = sum(len(nbrs) for nbrs in self.adj.values())
return total_edges * 64 + len(self.adj) * 100 # rough estimate

# Example: social network
G_adj = AdjacencyList(directed=False)
G_adj.add_edge(0, 1)
G_adj.add_edge(0, 2)
G_adj.add_edge(1, 3)

print(f"Neighbors of 0: {G_adj.neighbors(0)}") # [1, 2]
print(f"Degree of 0: {G_adj.degree(0)}") # 2

Memory comparison

For a graph with nn nodes and mm edges:

FormatMemoryAccess NeighborAccess AijA_{ij}
Dense adjacency matrixO(n2)O(n^2)O(n)O(n) scanO(1)O(1)
Sparse adjacency matrix (CSR)O(n+m)O(n + m)O(deg(v))O(\deg(v))O(deg(v))O(\deg(v))
Adjacency list (Python)O(n+m)O(n + m) dict overheadO(1)O(1)O(deg(v))O(\deg(v))
Edge list (COO)O(m)O(m)O(m)O(m) scanO(m)O(m) scan
# Memory comparison for a graph with n=10^6, m=10^7
n = 1_000_000
m = 10_000_000

dense_bytes = n * n * 4 # float32
print(f"Dense matrix: {dense_bytes / 1e12:.1f} TB") # 4 TB - impossible!

# CSR sparse
csr_bytes = (m + n + 1) * 4 + m * 4 # indices + indptr + data
print(f"CSR sparse: {csr_bytes / 1e6:.1f} MB") # 88 MB - feasible

# Edge list (COO)
coo_bytes = 2 * m * 4 # row + col arrays
print(f"Edge list (COO): {coo_bytes / 1e6:.1f} MB") # 80 MB - feasible

Part 3 - Edge List and PyTorch Geometric's edge_index

The edge list (COO format) stores edges as pairs of node indices. This is the format used by virtually all GNN frameworks.

PyTorch Geometric edge_index

import torch
from torch_geometric.data import Data

# edge_index: shape (2, m) - m edges
# edge_index[0, :] = source nodes
# edge_index[1, :] = target nodes
# Convention: edge (u, v) means "u sends message to v"

# Triangle graph: 0-1-2-0
edge_index = torch.tensor([[0, 1, 1, 2, 2, 0],
[1, 0, 2, 1, 0, 2]], dtype=torch.long)
# Row 0: sources [0, 1, 1, 2, 2, 0]
# Row 1: targets [1, 0, 2, 1, 0, 2]

# Node features: 3 nodes, 4 features each
x = torch.randn(3, 4)

# Edge features: 6 edges, 2 features each (optional)
edge_attr = torch.randn(6, 2)

# PyG Data object
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)
print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")
print(f"Node features: {data.x.shape}")
print(f"Edge features: {data.edge_attr.shape}")

# Convert between representations
import numpy as np
import networkx as nx

# edge_index → NetworkX graph
def to_networkx(edge_index: torch.Tensor, num_nodes: int) -> nx.Graph:
G = nx.Graph()
G.add_nodes_from(range(num_nodes))
edges = edge_index.t().tolist()
G.add_edges_from(edges)
return G

# NetworkX graph → edge_index
def from_networkx(G: nx.Graph) -> torch.Tensor:
edges = list(G.edges())
if len(edges) == 0:
return torch.zeros((2, 0), dtype=torch.long)
src = [e[0] for e in edges]
dst = [e[1] for e in edges]
# For undirected: add both directions
edge_index = torch.tensor([src + dst, dst + src], dtype=torch.long)
return edge_index

Adding reverse edges for undirected graphs

GNN message passing typically requires edges in both directions:

from torch_geometric.utils import to_undirected, add_self_loops, remove_self_loops

# Directed edge_index → undirected (add reverse edges)
edge_index = torch.tensor([[0, 1, 2], [1, 2, 0]], dtype=torch.long)
edge_index_undirected = to_undirected(edge_index)
print(f"Directed edges: {edge_index.shape[1]}") # 3
print(f"Undirected edges: {edge_index_undirected.shape[1]}") # 6

# Add self-loops (needed for GCN's à = A + I)
edge_index_with_self_loops, _ = add_self_loops(edge_index_undirected, num_nodes=3)
print(f"With self-loops: {edge_index_with_self_loops.shape[1]}") # 9

Part 4 - The Incidence Matrix

The incidence matrix B{1,0,1}n×mB \in \{-1, 0, 1\}^{n \times m} (for directed graphs) or {0,1}n×m\{0, 1\}^{n \times m} (undirected):

For directed graph: Bij={1if edge j starts at vertex i+1if edge j ends at vertex i0otherwiseB_{ij} = \begin{cases} -1 & \text{if edge } j \text{ starts at vertex } i \\ +1 & \text{if edge } j \text{ ends at vertex } i \\ 0 & \text{otherwise} \end{cases}

import numpy as np

# Graph: 0→1, 0→2, 1→2
# 4 nodes (0,1,2), 3 edges (e0, e1, e2)
# B[node, edge]:
# e0 (0→1): B[0,0]=-1, B[1,0]=+1
# e1 (0→2): B[0,1]=-1, B[2,1]=+1
# e2 (1→2): B[1,2]=-1, B[2,2]=+1

B = np.array([[-1, -1, 0], # node 0: source of e0, e1
[ 1, 0, -1], # node 1: target of e0, source of e2
[ 0, 1, 1]], dtype=np.float32) # node 2: target of e1, e2

# Graph Laplacian: L = B @ B.T (for undirected graphs with ±1 incidence)
L = B @ B.T
print("Laplacian from incidence matrix:")
print(L)
# L[i,i] = degree(i), L[i,j] = -1 if (i,j) is an edge, 0 otherwise

ML use of incidence matrix: Used in electrical circuit models, message passing on edges (rather than nodes), and certain physics-informed neural networks.

Part 5 - Batching Graphs for Mini-Batch Training

A key challenge in GNN training: each graph can have a different number of nodes and edges. Batching requires combining multiple graphs into one.

The PyTorch Geometric batching trick

Multiple graphs are combined into a single disjoint union by offsetting node indices:

Graph 1: nodes [0, 1, 2], edges [[0,1],[1,2],[2,0]]
Graph 2: nodes [0, 1], edges [[0,1],[1,0]]
Graph 3: nodes [0, 1, 2, 3], edges [[0,1],[1,2],[2,3],[3,0]]

Batch: nodes [0,1,2, 3,4, 5,6,7,8]
edges (Graph 1): [[0,1],[1,2],[2,0]] ← no offset
edges (Graph 2): [[3,4],[4,3]] ← offset by 3
edges (Graph 3): [[5,6],[6,7],[7,8],[8,5]] ← offset by 5
from torch_geometric.data import Data, Batch
import torch

# Three separate graphs
graph1 = Data(
x=torch.randn(3, 4),
edge_index=torch.tensor([[0,1,1,2,2,0],[1,0,2,1,0,2]], dtype=torch.long),
y=torch.tensor([0]) # Graph-level label
)
graph2 = Data(
x=torch.randn(2, 4),
edge_index=torch.tensor([[0,1],[1,0]], dtype=torch.long),
y=torch.tensor([1])
)

# Batch them together
batch = Batch.from_data_list([graph1, graph2])
print(f"Batch nodes: {batch.num_nodes}") # 5 (3 + 2)
print(f"Batch edges: {batch.num_edges}") # 8 (6 + 2)
print(f"batch.batch: {batch.batch}") # tensor([0,0,0,1,1]) - node→graph mapping
print(f"batch.ptr: {batch.ptr}") # tensor([0,3,5]) - graph start indices

# batch.batch is essential for global pooling operations:
# mean_pool = scatter_mean(x, batch.batch, dim=0)
# This averages each graph's nodes into a single graph-level vector

Efficient neighborhood sampling for large graphs

For graphs with millions of nodes, training on the full graph at once is infeasible. Mini-batch training requires neighborhood sampling:

from torch_geometric.loader import NeighborLoader

# NeighborLoader samples a fixed number of neighbors per node per layer
# For a 2-layer GNN: [num_neighbors_layer1, num_neighbors_layer2]
loader = NeighborLoader(
data, # Full graph data object
num_neighbors=[10, 5], # Sample 10 neighbors in layer 1, 5 in layer 2
batch_size=1024, # 1024 seed nodes per batch
input_nodes=train_mask, # Only seed from training nodes
shuffle=True
)

for batch in loader:
# batch contains a subgraph with the sampled neighborhood
# batch.n_id: original node IDs of nodes in this subgraph
# batch.e_id: original edge IDs of edges in this subgraph
out = model(batch.x, batch.edge_index)
loss = criterion(out[:batch.batch_size], batch.y[:batch.batch_size])
# Only compute loss on seed nodes (first batch_size nodes)

Part 6 - Heterogeneous Graph Representations

For graphs with multiple node/edge types (knowledge graphs, recommendation systems):

from torch_geometric.data import HeteroData
import torch

# Heterogeneous graph: papers and authors
data = HeteroData()

# Node features for each type
data['paper'].x = torch.randn(100, 128) # 100 papers, 128 features
data['author'].x = torch.randn(50, 64) # 50 authors, 64 features

# Edge types: (source_type, relation, target_type)
data['author', 'writes', 'paper'].edge_index = torch.tensor([
[0, 1, 1, 2], # author indices
[0, 0, 1, 2] # paper indices
], dtype=torch.long)

data['paper', 'cites', 'paper'].edge_index = torch.tensor([
[0, 1, 2, 3], # source papers
[1, 2, 3, 4] # target papers
], dtype=torch.long)

print(data)
print(f"Node types: {data.node_types}") # ['paper', 'author']
print(f"Edge types: {data.edge_types}") # [('author','writes','paper'), ('paper','cites','paper')]

Interview Questions

Q1: Compare the adjacency matrix and edge list representations. When would you choose each for a GNN?

Adjacency matrix ARn×nA \in \mathbb{R}^{n \times n}:

  • Memory: O(n2)O(n^2) for dense, O(n+m)O(n + m) for sparse
  • Strength: Direct algebraic operations (AkA^k, eigenvalues, AXAX matrix multiply)
  • Weakness: O(n2)O(n^2) memory is infeasible for large graphs (1M nodes = 4 TB)

Edge list (COO format):

  • Memory: O(m)O(m) - only stores non-zero edges
  • Strength: Minimal memory, natural format for GNN message passing (iterate over edges), easy to construct
  • Weakness: Random access to AijA_{ij} requires O(m)O(m) scan; matrix algebra is less natural

For GNNs:

  • Edge list (PyG's edge_index): Standard for all practical GNN workloads. A GNN layer iterates over edges: "for each edge (u,v)(u,v): aggregate huh_u into hvh_v." This is O(m)O(m) - optimal. The edge list exactly indexes this loop.
  • Sparse CSR adjacency: Use when you need fast row access for neighborhood iteration (e.g., in BFS-based sampling) or when computing propagation as a matrix multiply A~H\tilde{A} H (e.g., for spectral GNNs).
  • Dense matrix: Use only for very small graphs (n<5000n < 5000) where BLAS optimization dominates, or for spectral computations that require eigendecomposition.

Key rule: For graph ML, match the representation to the operation. GNN message passing → edge_index. Spectral clustering → sparse CSR adjacency. Neighborhood sampling → CSR adjacency for O(deg)O(\deg) neighbor access.

Q2: What is the PyTorch Geometric `edge_index` format and why does GNN batching require node index offsetting?

edge_index is a (2, m) integer tensor where:

  • edge_index[0, k]: source node of edge kk
  • edge_index[1, k]: target node of edge kk

This is the COO sparse format applied to directed edges. It encodes all mm edges using 2m2m integers.

Why offsetting is needed for batching: Node indices in separate graphs are independent (both use integers 0,1,,ni10, 1, \ldots, n_i - 1). When combined into a batch, node 0 from graph 1 and node 0 from graph 2 are different nodes, but have the same index.

The solution: offset node indices in graph ii by j<inj\sum_{j < i} n_j (cumulative sum of graph sizes). This creates a single disjoint union graph where all node indices are globally unique. The batch tensor (mapping node → graph ID) is stored alongside to enable:

  1. Global pooling: Average node features within each graph using scatter_mean(x, batch, dim=0)
  2. Recovering per-graph quantities: Use batch.ptr to slice results back into per-graph tensors

This batching trick is why PyTorch Geometric can efficiently train on variable-size graph batches - the GNN sees a single large graph and uses the batch tensor to compute separate graph-level representations.

Q3: How does the normalized adjacency matrix à affect GNN feature propagation?

The GCN propagation rule is: H(l+1)=σ(A~H(l)W(l))H^{(l+1)} = \sigma(\tilde{A} H^{(l)} W^{(l)}) where A~=D^1/2(A+I)D^1/2\tilde{A} = \hat{D}^{-1/2} (A + I) \hat{D}^{-1/2}.

Without normalization (raw AA): Each node's new feature is the sum of its neighbors' features. High-degree nodes aggregate much more information - their features grow larger with each layer. Low-degree nodes receive less information - under-representation problem.

With symmetric normalization D1/2AD1/2D^{-1/2} A D^{-1/2}: Each edge (u,v)(u, v) contributes a weight of 1/deg(u)deg(v)1/\sqrt{\deg(u) \cdot \deg(v)}. High-degree nodes contribute less per neighbor and receive less per neighbor - feature magnitudes remain stable across layers.

Adding self-loops (A+I)(A + I): Ensures each node also retains its own features from the previous layer. Without self-loops, a node's feature at layer l+1l+1 only contains aggregated neighbor information - the node's own features are lost.

Spectral interpretation: The eigenvalues of A~\tilde{A} lie in [1,1][-1, 1]. The propagation H(l+1)=A~H(l)WH^{(l+1)} = \tilde{A} H^{(l)} W applies a spectral filter to the node features. Eigenvalues near 1 correspond to smooth (low-frequency) graph signals that are preserved; eigenvalues near 0 or -1 attenuate high-frequency signals. This is why many GNN layers of GCN-type produce increasingly smooth features - over-smoothing.

Q4: What are the tradeoffs between storing graphs in adjacency list vs CSR format in terms of memory and cache performance?

Python adjacency list (dict[int, list[int]]):

  • Each node is a Python object (dict key): ~50 bytes overhead
  • Each neighbor list is a Python list: ~56 bytes + ~8 bytes per element (pointer)
  • Total: O(n×50+m×8)O(n \times 50 + m \times 8) bytes plus Python object overhead
  • Cache performance: poor - following dict pointers causes cache misses

CSR (Compressed Sparse Row):

  • data: mm float32 values - 4 bytes each
  • indices: mm int32 values - 4 bytes each
  • indptr: (n+1)(n+1) int32 values - 4 bytes each
  • Total: (2m+n+1)×4(2m + n + 1) \times 4 bytes
  • Cache performance: good - data[indptr[i]:indptr[i+1]] is a contiguous memory block

Numerical example (n=106n = 10^6, m=107m = 10^7, average degree 10):

  • Python adj list: 50×106+8×107=50+80=130\approx 50 \times 10^6 + 8 \times 10^7 = 50 + 80 = 130 MB, poor cache behavior
  • CSR: (2×107+106)×4=84(2 \times 10^7 + 10^6) \times 4 = 84 MB, excellent cache behavior (contiguous)

For GNN training at scale, CSR (or edge_index in PyG which is essentially COO) is always preferred over Python dicts/lists. The contiguous memory layout allows the GPU to efficiently parallelize message passing over all edges simultaneously.

Q5: How would you represent a heterogeneous knowledge graph in a GNN framework?

A heterogeneous knowledge graph has:

  • Multiple node types (entities: people, organizations, concepts)
  • Multiple edge types (relations: works_at, founded_by, located_in)

Representation approaches:

1. PyTorch Geometric HeteroData:

data = HeteroData()
data['entity_type_A'].x = node_features_A # (n_A, d_A)
data['entity_type_B'].x = node_features_B # (n_B, d_B)
data['entity_type_A', 'rel_1', 'entity_type_B'].edge_index = ...
data['entity_type_B', 'rel_2', 'entity_type_A'].edge_index = ...

Each (source_type, relation, target_type) triple has its own edge_index. The model applies separate weight matrices per relation type.

2. Type-conditioned edge features: Encode relation type as an integer edge feature alongside a homogeneous edge_index. The GNN conditions message computation on the edge type.

3. Separate adjacency matrices per relation: ArA_{r} for each relation type rr. RGCN uses: hv(l+1)=ruNr(v)1cv,rWrhu(l)+W0hv(l)h_v^{(l+1)} = \sum_r \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r h_u^{(l)} + W_0 h_v^{(l)}

Memory consideration: A knowledge graph with 100 relation types and RGCN would need 100 separate weight matrices - O(d2×r)O(d^2 \times r) parameters. Practical implementations use basis decomposition or block-diagonal matrices to reduce this.

For link prediction in knowledge graphs (most common task), transductive methods (TransE, RotatE) often work better than GNNs because they directly model entity-relation-entity triples without needing to aggregate over neighborhoods.

Quick Reference

FormatMemoryNeighbor AccessAijA_{ij} AccessBest For
Dense matrixO(n2)O(n^2)O(n)O(n)O(1)O(1)Small graphs, spectral ops
Sparse CSRO(n+m)O(n + m)O(deg)O(\deg)O(deg)O(\deg)Matrix multiply, solvers
Adjacency list (Python)O(n+m)O(n + m)O(1)O(1)O(deg)O(\deg)Traversal (BFS/DFS)
Edge list (COO)O(m)O(m)O(m)O(m)O(m)O(m)GNN message passing
Incidence matrixO(nm)O(n \cdot m)O(m)O(m)-Edge-centric operations
# Quick conversion guide (PyTorch Geometric)
from torch_geometric.utils import (
to_dense_adj, # edge_index → dense adjacency matrix
dense_to_sparse, # dense matrix → edge_index + edge_weight
to_scipy_sparse_matrix, # edge_index → scipy sparse
from_scipy_sparse_matrix # scipy sparse → edge_index
)

Next: Lesson 03: Graph Algorithms →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Graph Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.