Graph Representations - Adjacency Matrix, Edge List, and ML Tradeoffs

Reading time: ~20 minutes | Level: Graph Theory → ML Engineering

Your GNN training run is using 40 GB of memory for a graph with only 100,000 nodes. The culprit: you stored the adjacency matrix as dense float32 - $10^5 \times 10^5 \times 4$ bytes = 40 GB. The same graph stored as an edge list would take 3 MB.

Choosing the right graph representation is not a theoretical exercise. It directly determines whether your GNN fits in memory, whether batching is efficient, and which operations are fast versus prohibitively slow.

What You Will Learn

Adjacency matrix: dense and sparse, properties, algebraic uses
Adjacency list: the standard for graph traversal
Edge list (COO format): the standard for GNN frameworks
Incidence matrix: edges as first-class objects
Memory and compute trade-offs for each format
PyTorch Geometric's edge_index format
Batching multiple graphs for mini-batch GNN training

Part 1 - The Adjacency Matrix

Definition

For a graph $G = (V, E)$ with $n = |V|$ vertices, the adjacency matrix $A \in \{0, 1\}^{n \times n}$ is:

$A_{ij} = \begin{cases} 1 & \text{if } (i, j) \in E \\ 0 & \text{otherwise} \end{cases}$

For undirected graphs: $A = A^T$ (symmetric). For weighted graphs: $A_{ij} = w(i,j)$ .

Properties and ML connections

import numpy as np
import networkx as nx

G = nx.karate_club_graph()
A = nx.to_numpy_array(G, dtype=np.float32)
print(f"Adjacency matrix shape: {A.shape}")   # (34, 34)
print(f"Symmetric: {np.allclose(A, A.T)}")    # True (undirected)
print(f"Non-zeros: {int(A.sum())}")            # 2 × number of edges

# Property 1: Row sums = degree
degrees = A.sum(axis=1)
print(f"First 5 degrees: {degrees[:5]}")

# Property 2: A^k[i,j] counts the number of walks of length k from i to j
A_power_2 = A @ A
print(f"\nA²[0,0] = {A_power_2[0,0]:.0f} = number of walks of length 2 from node 0 to itself")
print(f"(This equals the degree of node 0: {int(degrees[0])})")

# Property 3: Eigenvalues encode global graph structure
eigenvalues = np.linalg.eigvalsh(A)  # eigvalsh: symmetric, returns sorted eigenvalues
print(f"\nLargest eigenvalue (spectral radius): {eigenvalues[-1]:.4f}")
print(f"Second largest eigenvalue: {eigenvalues[-2]:.4f}")
# Gap between eigenvalues relates to community structure

# Property 4: Memory cost
print(f"\nDense adjacency memory: {A.nbytes / 1e3:.1f} KB")  # Fine for n=34
# For n=10^6: 10^6 × 10^6 × 4 bytes = 4 TB - impossible!

When to use the adjacency matrix

Use Case	Use Matrix?	Why
Small graphs ( $n < 5000$ )	Yes	Dense BLAS operations are fast
Spectral computations (eigenvalues)	Yes (sparse)	`scipy.sparse.linalg.eigsh`
Linear algebra operations ( $A^k$ , $e^A$ )	Yes (sparse)	Matrix power is natural
GNN propagation on large graphs	No	Too much memory; use edge_index
Graph traversal (BFS, DFS)	No	Adjacency list is faster

Normalized adjacency matrix (GCN)

In Graph Convolutional Networks, the raw adjacency matrix is normalized to prevent vanishing/exploding features:

$\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}, \quad \tilde{A} = A + I$

import scipy.sparse as sp
import numpy as np

def compute_normalized_adjacency(A: np.ndarray) -> np.ndarray:
    """
    Compute symmetrically normalized adjacency matrix for GCN.
    Ã = D^{-1/2} (A + I) D^{-1/2}
    """
    A_hat = A + np.eye(A.shape[0])   # Add self-loops
    D = np.diag(A_hat.sum(axis=1))   # Degree matrix
    D_inv_sqrt = np.diag(1.0 / np.sqrt(np.diag(D) + 1e-10))
    return D_inv_sqrt @ A_hat @ D_inv_sqrt

# For large graphs: use sparse matrices
def compute_normalized_adjacency_sparse(A_sparse: sp.spmatrix) -> sp.spmatrix:
    """Sparse version - the only feasible option for large graphs."""
    n = A_sparse.shape[0]
    A_hat = A_sparse + sp.eye(n, dtype=np.float32)
    degrees = np.array(A_hat.sum(axis=1)).flatten()
    d_inv_sqrt = np.power(degrees, -0.5)
    d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0
    D_inv_sqrt = sp.diags(d_inv_sqrt)
    return D_inv_sqrt @ A_hat @ D_inv_sqrt

Part 2 - Adjacency List

The adjacency list representation stores, for each vertex $v$ , the list of its neighbors:

$\text{adj}[v] = \{u : (v, u) \in E\}$

# Python adjacency list: dict of sets or dict of lists
class AdjacencyList:
    """Memory-efficient adjacency list for sparse graphs."""

    def __init__(self, directed: bool = False):
        self.adj = {}
        self.directed = directed

    def add_node(self, v):
        if v not in self.adj:
            self.adj[v] = []

    def add_edge(self, u, v, weight: float = 1.0):
        self.add_node(u)
        self.add_node(v)
        self.adj[u].append((v, weight))
        if not self.directed:
            self.adj[v].append((u, weight))

    def neighbors(self, v):
        return [u for u, _ in self.adj.get(v, [])]

    def degree(self, v):
        return len(self.adj.get(v, []))

    def memory_bytes(self) -> int:
        """Approximate memory usage."""
        # Each edge entry: ~64 bytes (Python object overhead)
        total_edges = sum(len(nbrs) for nbrs in self.adj.values())
        return total_edges * 64 + len(self.adj) * 100  # rough estimate

# Example: social network
G_adj = AdjacencyList(directed=False)
G_adj.add_edge(0, 1)
G_adj.add_edge(0, 2)
G_adj.add_edge(1, 3)

print(f"Neighbors of 0: {G_adj.neighbors(0)}")   # [1, 2]
print(f"Degree of 0: {G_adj.degree(0)}")          # 2

Memory comparison

For a graph with $n$ nodes and $m$ edges:

Format	Memory	Access Neighbor	Access $A_{ij}$
Dense adjacency matrix	$O(n^2)$	$O(n)$ scan	$O(1)$
Sparse adjacency matrix (CSR)	$O(n + m)$	$O(\deg(v))$	$O(\deg(v))$
Adjacency list (Python)	$O(n + m)$ dict overhead	$O(1)$	$O(\deg(v))$
Edge list (COO)	$O(m)$	$O(m)$ scan	$O(m)$ scan

# Memory comparison for a graph with n=10^6, m=10^7
n = 1_000_000
m = 10_000_000

dense_bytes = n * n * 4   # float32
print(f"Dense matrix: {dense_bytes / 1e12:.1f} TB")   # 4 TB - impossible!

# CSR sparse
csr_bytes = (m + n + 1) * 4 + m * 4  # indices + indptr + data
print(f"CSR sparse: {csr_bytes / 1e6:.1f} MB")        # 88 MB - feasible

# Edge list (COO)
coo_bytes = 2 * m * 4   # row + col arrays
print(f"Edge list (COO): {coo_bytes / 1e6:.1f} MB")   # 80 MB - feasible

Part 3 - Edge List and PyTorch Geometric's edge_index

The edge list (COO format) stores edges as pairs of node indices. This is the format used by virtually all GNN frameworks.

PyTorch Geometric edge_index

import torch
from torch_geometric.data import Data

# edge_index: shape (2, m) - m edges
# edge_index[0, :] = source nodes
# edge_index[1, :] = target nodes
# Convention: edge (u, v) means "u sends message to v"

# Triangle graph: 0-1-2-0
edge_index = torch.tensor([[0, 1, 1, 2, 2, 0],
                            [1, 0, 2, 1, 0, 2]], dtype=torch.long)
# Row 0: sources [0, 1, 1, 2, 2, 0]
# Row 1: targets [1, 0, 2, 1, 0, 2]

# Node features: 3 nodes, 4 features each
x = torch.randn(3, 4)

# Edge features: 6 edges, 2 features each (optional)
edge_attr = torch.randn(6, 2)

# PyG Data object
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr)
print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")
print(f"Node features: {data.x.shape}")
print(f"Edge features: {data.edge_attr.shape}")

# Convert between representations
import numpy as np
import networkx as nx

# edge_index → NetworkX graph
def to_networkx(edge_index: torch.Tensor, num_nodes: int) -> nx.Graph:
    G = nx.Graph()
    G.add_nodes_from(range(num_nodes))
    edges = edge_index.t().tolist()
    G.add_edges_from(edges)
    return G

# NetworkX graph → edge_index
def from_networkx(G: nx.Graph) -> torch.Tensor:
    edges = list(G.edges())
    if len(edges) == 0:
        return torch.zeros((2, 0), dtype=torch.long)
    src = [e[0] for e in edges]
    dst = [e[1] for e in edges]
    # For undirected: add both directions
    edge_index = torch.tensor([src + dst, dst + src], dtype=torch.long)
    return edge_index

Adding reverse edges for undirected graphs

GNN message passing typically requires edges in both directions:

from torch_geometric.utils import to_undirected, add_self_loops, remove_self_loops

# Directed edge_index → undirected (add reverse edges)
edge_index = torch.tensor([[0, 1, 2], [1, 2, 0]], dtype=torch.long)
edge_index_undirected = to_undirected(edge_index)
print(f"Directed edges: {edge_index.shape[1]}")         # 3
print(f"Undirected edges: {edge_index_undirected.shape[1]}")  # 6

# Add self-loops (needed for GCN's Ã = A + I)
edge_index_with_self_loops, _ = add_self_loops(edge_index_undirected, num_nodes=3)
print(f"With self-loops: {edge_index_with_self_loops.shape[1]}")  # 9

Part 4 - The Incidence Matrix

The incidence matrix $B \in \{-1, 0, 1\}^{n \times m}$ (for directed graphs) or $\{0, 1\}^{n \times m}$ (undirected):

For directed graph: $B_{ij} = \begin{cases} -1 & \text{if edge } j \text{ starts at vertex } i \\ +1 & \text{if edge } j \text{ ends at vertex } i \\ 0 & \text{otherwise} \end{cases}$

import numpy as np

# Graph: 0→1, 0→2, 1→2
# 4 nodes (0,1,2), 3 edges (e0, e1, e2)
# B[node, edge]:
#   e0 (0→1): B[0,0]=-1, B[1,0]=+1
#   e1 (0→2): B[0,1]=-1, B[2,1]=+1
#   e2 (1→2): B[1,2]=-1, B[2,2]=+1

B = np.array([[-1, -1,  0],   # node 0: source of e0, e1
              [ 1,  0, -1],   # node 1: target of e0, source of e2
              [ 0,  1,  1]], dtype=np.float32)  # node 2: target of e1, e2

# Graph Laplacian: L = B @ B.T (for undirected graphs with ±1 incidence)
L = B @ B.T
print("Laplacian from incidence matrix:")
print(L)
# L[i,i] = degree(i), L[i,j] = -1 if (i,j) is an edge, 0 otherwise

ML use of incidence matrix: Used in electrical circuit models, message passing on edges (rather than nodes), and certain physics-informed neural networks.

Part 5 - Batching Graphs for Mini-Batch Training

A key challenge in GNN training: each graph can have a different number of nodes and edges. Batching requires combining multiple graphs into one.

The PyTorch Geometric batching trick

Multiple graphs are combined into a single disjoint union by offsetting node indices:

Graph 1: nodes [0, 1, 2], edges [[0,1],[1,2],[2,0]]
Graph 2: nodes [0, 1],    edges [[0,1],[1,0]]
Graph 3: nodes [0, 1, 2, 3], edges [[0,1],[1,2],[2,3],[3,0]]

Batch:   nodes [0,1,2, 3,4, 5,6,7,8]
         edges (Graph 1): [[0,1],[1,2],[2,0]]      ← no offset
         edges (Graph 2): [[3,4],[4,3]]              ← offset by 3
         edges (Graph 3): [[5,6],[6,7],[7,8],[8,5]] ← offset by 5

from torch_geometric.data import Data, Batch
import torch

# Three separate graphs
graph1 = Data(
    x=torch.randn(3, 4),
    edge_index=torch.tensor([[0,1,1,2,2,0],[1,0,2,1,0,2]], dtype=torch.long),
    y=torch.tensor([0])  # Graph-level label
)
graph2 = Data(
    x=torch.randn(2, 4),
    edge_index=torch.tensor([[0,1],[1,0]], dtype=torch.long),
    y=torch.tensor([1])
)

# Batch them together
batch = Batch.from_data_list([graph1, graph2])
print(f"Batch nodes: {batch.num_nodes}")    # 5 (3 + 2)
print(f"Batch edges: {batch.num_edges}")    # 8 (6 + 2)
print(f"batch.batch: {batch.batch}")        # tensor([0,0,0,1,1]) - node→graph mapping
print(f"batch.ptr: {batch.ptr}")            # tensor([0,3,5]) - graph start indices

# batch.batch is essential for global pooling operations:
# mean_pool = scatter_mean(x, batch.batch, dim=0)
# This averages each graph's nodes into a single graph-level vector

Efficient neighborhood sampling for large graphs

For graphs with millions of nodes, training on the full graph at once is infeasible. Mini-batch training requires neighborhood sampling:

from torch_geometric.loader import NeighborLoader

# NeighborLoader samples a fixed number of neighbors per node per layer
# For a 2-layer GNN: [num_neighbors_layer1, num_neighbors_layer2]
loader = NeighborLoader(
    data,                          # Full graph data object
    num_neighbors=[10, 5],         # Sample 10 neighbors in layer 1, 5 in layer 2
    batch_size=1024,               # 1024 seed nodes per batch
    input_nodes=train_mask,        # Only seed from training nodes
    shuffle=True
)

for batch in loader:
    # batch contains a subgraph with the sampled neighborhood
    # batch.n_id: original node IDs of nodes in this subgraph
    # batch.e_id: original edge IDs of edges in this subgraph
    out = model(batch.x, batch.edge_index)
    loss = criterion(out[:batch.batch_size], batch.y[:batch.batch_size])
    # Only compute loss on seed nodes (first batch_size nodes)

Part 6 - Heterogeneous Graph Representations

For graphs with multiple node/edge types (knowledge graphs, recommendation systems):

from torch_geometric.data import HeteroData
import torch

# Heterogeneous graph: papers and authors
data = HeteroData()

# Node features for each type
data['paper'].x = torch.randn(100, 128)    # 100 papers, 128 features
data['author'].x = torch.randn(50, 64)     # 50 authors, 64 features

# Edge types: (source_type, relation, target_type)
data['author', 'writes', 'paper'].edge_index = torch.tensor([
    [0, 1, 1, 2],  # author indices
    [0, 0, 1, 2]   # paper indices
], dtype=torch.long)

data['paper', 'cites', 'paper'].edge_index = torch.tensor([
    [0, 1, 2, 3],  # source papers
    [1, 2, 3, 4]   # target papers
], dtype=torch.long)

print(data)
print(f"Node types: {data.node_types}")    # ['paper', 'author']
print(f"Edge types: {data.edge_types}")    # [('author','writes','paper'), ('paper','cites','paper')]

Interview Questions

Q1: Compare the adjacency matrix and edge list representations. When would you choose each for a GNN?

Adjacency matrix $A \in \mathbb{R}^{n \times n}$ :

Memory: $O(n^2)$ for dense, $O(n + m)$ for sparse
Strength: Direct algebraic operations ( $A^k$ , eigenvalues, $AX$ matrix multiply)
Weakness: $O(n^2)$ memory is infeasible for large graphs (1M nodes = 4 TB)

Edge list (COO format):

Memory: $O(m)$ - only stores non-zero edges
Strength: Minimal memory, natural format for GNN message passing (iterate over edges), easy to construct
Weakness: Random access to $A_{ij}$ requires $O(m)$ scan; matrix algebra is less natural

For GNNs:

Edge list (PyG's edge_index): Standard for all practical GNN workloads. A GNN layer iterates over edges: "for each edge $(u,v)$ : aggregate $h_u$ into $h_v$ ." This is $O(m)$ - optimal. The edge list exactly indexes this loop.
Sparse CSR adjacency: Use when you need fast row access for neighborhood iteration (e.g., in BFS-based sampling) or when computing propagation as a matrix multiply $\tilde{A} H$ (e.g., for spectral GNNs).
Dense matrix: Use only for very small graphs ( $n < 5000$ ) where BLAS optimization dominates, or for spectral computations that require eigendecomposition.

Key rule: For graph ML, match the representation to the operation. GNN message passing → edge_index. Spectral clustering → sparse CSR adjacency. Neighborhood sampling → CSR adjacency for $O(\deg)$ neighbor access.

Q2: What is the PyTorch Geometric `edge_index` format and why does GNN batching require node index offsetting?

edge_index is a (2, m) integer tensor where:

edge_index[0, k]: source node of edge $k$
edge_index[1, k]: target node of edge $k$

This is the COO sparse format applied to directed edges. It encodes all $m$ edges using $2m$ integers.

Why offsetting is needed for batching: Node indices in separate graphs are independent (both use integers $0, 1, \ldots, n_i - 1$ ). When combined into a batch, node 0 from graph 1 and node 0 from graph 2 are different nodes, but have the same index.

The solution: offset node indices in graph $i$ by $\sum_{j < i} n_j$ (cumulative sum of graph sizes). This creates a single disjoint union graph where all node indices are globally unique. The batch tensor (mapping node → graph ID) is stored alongside to enable:

Global pooling: Average node features within each graph using scatter_mean(x, batch, dim=0)
Recovering per-graph quantities: Use batch.ptr to slice results back into per-graph tensors

This batching trick is why PyTorch Geometric can efficiently train on variable-size graph batches - the GNN sees a single large graph and uses the batch tensor to compute separate graph-level representations.

Q3: How does the normalized adjacency matrix Ã affect GNN feature propagation?

The GCN propagation rule is: $H^{(l+1)} = \sigma(\tilde{A} H^{(l)} W^{(l)})$ where $\tilde{A} = \hat{D}^{-1/2} (A + I) \hat{D}^{-1/2}$ .

Without normalization (raw $A$ ): Each node's new feature is the sum of its neighbors' features. High-degree nodes aggregate much more information - their features grow larger with each layer. Low-degree nodes receive less information - under-representation problem.

With symmetric normalization $D^{-1/2} A D^{-1/2}$ : Each edge $(u, v)$ contributes a weight of $1/\sqrt{\deg(u) \cdot \deg(v)}$ . High-degree nodes contribute less per neighbor and receive less per neighbor - feature magnitudes remain stable across layers.

Adding self-loops $(A + I)$ : Ensures each node also retains its own features from the previous layer. Without self-loops, a node's feature at layer $l+1$ only contains aggregated neighbor information - the node's own features are lost.

Spectral interpretation: The eigenvalues of $\tilde{A}$ lie in $[-1, 1]$ . The propagation $H^{(l+1)} = \tilde{A} H^{(l)} W$ applies a spectral filter to the node features. Eigenvalues near 1 correspond to smooth (low-frequency) graph signals that are preserved; eigenvalues near 0 or -1 attenuate high-frequency signals. This is why many GNN layers of GCN-type produce increasingly smooth features - over-smoothing.

Q4: What are the tradeoffs between storing graphs in adjacency list vs CSR format in terms of memory and cache performance?

Python adjacency list (dict[int, list[int]]):

Each node is a Python object (dict key): ~50 bytes overhead
Each neighbor list is a Python list: ~56 bytes + ~8 bytes per element (pointer)
Total: $O(n \times 50 + m \times 8)$ bytes plus Python object overhead
Cache performance: poor - following dict pointers causes cache misses

CSR (Compressed Sparse Row):

data: $m$ float32 values - 4 bytes each
indices: $m$ int32 values - 4 bytes each
indptr: $(n+1)$ int32 values - 4 bytes each
Total: $(2m + n + 1) \times 4$ bytes
Cache performance: good - data[indptr[i]:indptr[i+1]] is a contiguous memory block

Numerical example ( $n = 10^6$ , $m = 10^7$ , average degree 10):

Python adj list: $\approx 50 \times 10^6 + 8 \times 10^7 = 50 + 80 = 130$ MB, poor cache behavior
CSR: $(2 \times 10^7 + 10^6) \times 4 = 84$ MB, excellent cache behavior (contiguous)

For GNN training at scale, CSR (or edge_index in PyG which is essentially COO) is always preferred over Python dicts/lists. The contiguous memory layout allows the GPU to efficiently parallelize message passing over all edges simultaneously.

Q5: How would you represent a heterogeneous knowledge graph in a GNN framework?

A heterogeneous knowledge graph has:

Multiple node types (entities: people, organizations, concepts)
Multiple edge types (relations: works_at, founded_by, located_in)

Representation approaches:

1. PyTorch Geometric HeteroData:

data = HeteroData()
data['entity_type_A'].x = node_features_A          # (n_A, d_A)
data['entity_type_B'].x = node_features_B          # (n_B, d_B)
data['entity_type_A', 'rel_1', 'entity_type_B'].edge_index = ...
data['entity_type_B', 'rel_2', 'entity_type_A'].edge_index = ...

Each (source_type, relation, target_type) triple has its own edge_index. The model applies separate weight matrices per relation type.

2. Type-conditioned edge features: Encode relation type as an integer edge feature alongside a homogeneous edge_index. The GNN conditions message computation on the edge type.

3. Separate adjacency matrices per relation: $A_{r}$ for each relation type $r$ . RGCN uses: $h_v^{(l+1)} = \sum_r \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r h_u^{(l)} + W_0 h_v^{(l)}$

Memory consideration: A knowledge graph with 100 relation types and RGCN would need 100 separate weight matrices - $O(d^2 \times r)$ parameters. Practical implementations use basis decomposition or block-diagonal matrices to reduce this.

For link prediction in knowledge graphs (most common task), transductive methods (TransE, RotatE) often work better than GNNs because they directly model entity-relation-entity triples without needing to aggregate over neighborhoods.

Quick Reference

Format	Memory	Neighbor Access	$A_{ij}$ Access	Best For
Dense matrix	$O(n^2)$	$O(n)$	$O(1)$	Small graphs, spectral ops
Sparse CSR	$O(n + m)$	$O(\deg)$	$O(\deg)$	Matrix multiply, solvers
Adjacency list (Python)	$O(n + m)$	$O(1)$	$O(\deg)$	Traversal (BFS/DFS)
Edge list (COO)	$O(m)$	$O(m)$	$O(m)$	GNN message passing
Incidence matrix	$O(n \cdot m)$	$O(m)$	-	Edge-centric operations

# Quick conversion guide (PyTorch Geometric)
from torch_geometric.utils import (
    to_dense_adj,      # edge_index → dense adjacency matrix
    dense_to_sparse,   # dense matrix → edge_index + edge_weight
    to_scipy_sparse_matrix,  # edge_index → scipy sparse
    from_scipy_sparse_matrix # scipy sparse → edge_index
)

Next: Lesson 03: Graph Algorithms →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Graph Explorer demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Adjacency Matrix​

Definition​

Properties and ML connections​

When to use the adjacency matrix​

Normalized adjacency matrix (GCN)​

Part 2 - Adjacency List​

Memory comparison​

Part 3 - Edge List and PyTorch Geometric's edge_index​

PyTorch Geometric edge_index​

Adding reverse edges for undirected graphs​

Part 4 - The Incidence Matrix​

Part 5 - Batching Graphs for Mini-Batch Training​

The PyTorch Geometric batching trick​

Efficient neighborhood sampling for large graphs​

Part 6 - Heterogeneous Graph Representations​

Interview Questions​

Quick Reference​