Graph Algorithms - BFS, DFS, Dijkstra, PageRank, and ML Feature Engineering
Reading time: ~24 minutes | Level: Graph Theory → ML Engineering
You are building a fraud detection system for a financial network. Each transaction is a directed edge, each account is a node. You want features that capture: how central is this account in the transaction network? Are there short cycles suggesting money laundering? Does this account sit on many shortest paths between other accounts?
These questions are answered by graph algorithms. They are the bridge between raw graph structure and ML-usable features. In the GNN era, knowing graph algorithms also tells you what information your model can theoretically compute - and what it cannot.
What You Will Learn
- BFS and DFS: traversal, connected components, bipartite detection
- Dijkstra's algorithm: shortest paths in weighted graphs
- Topological sort: execution order for DAGs
- Minimum spanning tree (Kruskal, Prim)
- PageRank: random walk based node importance
- Graph feature engineering for tabular ML and GNN initialization
- Complexity analysis and practical Python implementations
Part 1 - Breadth-First Search (BFS)
Algorithm
BFS explores a graph level by level from a source node, using a queue:
BFS(G, source):
Initialize: visited = {source}, queue = [source], distance = {source: 0}
While queue is not empty:
v = queue.pop_front()
For each neighbor u of v:
If u not in visited:
visited.add(u)
distance[u] = distance[v] + 1
queue.append(u)
Return distances
Time complexity: Space complexity:
from collections import deque
import networkx as nx
import numpy as np
def bfs(adj: dict, source: int) -> dict:
"""
BFS from source node.
Returns shortest-path distances (unweighted) to all reachable nodes.
"""
visited = {source}
distance = {source: 0}
queue = deque([source])
while queue:
v = queue.popleft()
for u in adj.get(v, []):
if u not in visited:
visited.add(u)
distance[u] = distance[v] + 1
queue.append(u)
return distance
# ML application: compute closeness centrality efficiently
def closeness_centrality_from_bfs(adj: dict, n: int) -> dict:
"""Closeness centrality = n-1 / sum(distances from v to all others)."""
centrality = {}
nodes = list(adj.keys())
for v in nodes:
dists = bfs(adj, v)
if len(dists) > 1:
avg_dist = sum(dists.values()) / (len(dists) - 1)
centrality[v] = (len(dists) - 1) / (avg_dist * (n - 1)) if avg_dist > 0 else 0
else:
centrality[v] = 0
return centrality
# BFS for finding connected components
def connected_components_bfs(adj: dict) -> list:
"""Find all connected components using BFS."""
all_nodes = set(adj.keys())
visited = set()
components = []
for start in all_nodes:
if start not in visited:
# BFS from unvisited node
component = set()
queue = deque([start])
while queue:
v = queue.popleft()
if v not in visited:
visited.add(v)
component.add(v)
for u in adj.get(v, []):
if u not in visited:
queue.append(u)
components.append(component)
return components
BFS in GNNs: k-hop neighborhoods
BFS directly corresponds to GNN receptive fields - a -layer GNN aggregates exactly the -hop BFS neighborhood of each node:
import networkx as nx
from torch_geometric.utils import k_hop_subgraph
import torch
G = nx.karate_club_graph()
edge_index = torch.tensor(list(G.edges())).t().contiguous()
# Make undirected
edge_index = torch.cat([edge_index, edge_index.flip(0)], dim=1)
# 2-hop neighborhood of node 0 (what a 2-layer GNN "sees")
node_idx = 0
num_hops = 2
subset, sub_edge_index, mapping, edge_mask = k_hop_subgraph(
node_idx=node_idx,
num_hops=num_hops,
edge_index=edge_index,
num_nodes=G.number_of_nodes()
)
print(f"Nodes in {num_hops}-hop neighborhood of node 0: {subset.tolist()}")
print(f"Number of nodes: {len(subset)}")
# These are exactly the nodes a 2-layer GNN uses to compute node 0's representation
Part 2 - Depth-First Search (DFS)
DFS explores a graph by going as deep as possible before backtracking, using a stack (or recursion):
def dfs(adj: dict, source: int) -> list:
"""
Iterative DFS from source.
Returns nodes in DFS visit order.
"""
visited = set()
stack = [source]
order = []
while stack:
v = stack.pop()
if v not in visited:
visited.add(v)
order.append(v)
# Add neighbors to stack (reverse order for left-to-right DFS)
for u in reversed(list(adj.get(v, []))):
if u not in visited:
stack.append(u)
return order
def has_cycle_undirected(adj: dict) -> bool:
"""
Detect cycle in undirected graph using DFS.
A back edge (to a visited non-parent node) indicates a cycle.
"""
visited = set()
def dfs_cycle(v, parent):
visited.add(v)
for u in adj.get(v, []):
if u not in visited:
if dfs_cycle(u, v):
return True
elif u != parent:
return True # Back edge to non-parent = cycle
return False
for v in adj:
if v not in visited:
if dfs_cycle(v, -1):
return True
return False
# DFS for topological sort
def topological_sort_dfs(adj: dict) -> list:
"""
Topological sort using DFS (post-order traversal).
Only valid for DAGs.
"""
visited = set()
result = []
def dfs_topo(v):
visited.add(v)
for u in adj.get(v, []):
if u not in visited:
dfs_topo(u)
result.append(v) # Add to result AFTER processing all neighbors
for v in adj:
if v not in visited:
dfs_topo(v)
return result[::-1] # Reverse post-order = topological order
Part 3 - Dijkstra's Shortest Path Algorithm
For weighted graphs with non-negative weights, Dijkstra's algorithm finds shortest paths from a source to all other vertices.
Key idea: Greedy - always expand the unvisited node with the smallest known distance.
Time complexity: with a binary heap.
import heapq
import numpy as np
def dijkstra(adj_weighted: dict, source: int) -> dict:
"""
Dijkstra's shortest path algorithm.
adj_weighted[v] = [(neighbor, weight), ...] - adjacency list with weights.
Returns: dict of shortest distances from source to all reachable nodes.
"""
distances = {source: 0.0}
pq = [(0.0, source)] # (distance, node) priority queue
while pq:
dist_v, v = heapq.heappop(pq)
# Skip if we've already found a shorter path
if dist_v > distances.get(v, float('inf')):
continue
for u, weight in adj_weighted.get(v, []):
new_dist = dist_v + weight
if new_dist < distances.get(u, float('inf')):
distances[u] = new_dist
heapq.heappush(pq, (new_dist, u))
return distances
# Example: road network
road_network = {
'NYC': [('Boston', 215), ('DC', 225), ('Philly', 95)],
'Boston': [('NYC', 215), ('Providence', 50)],
'DC': [('NYC', 225), ('Philly', 140), ('Richmond', 110)],
'Philly': [('NYC', 95), ('DC', 140)],
'Providence': [('Boston', 50)],
'Richmond': [('DC', 110)]
}
distances_from_nyc = dijkstra(road_network, 'NYC')
for city, dist in sorted(distances_from_nyc.items(), key=lambda x: x[1]):
print(f"NYC → {city}: {dist} miles")
# scipy/networkx equivalents
import networkx as nx
G = nx.Graph()
for u, neighbors in road_network.items():
for v, w in neighbors:
G.add_edge(u, v, weight=w)
nx_distances = nx.single_source_dijkstra_path_length(G, 'NYC')
all_pairs = dict(nx.all_pairs_dijkstra_path_length(G))
Shortest path features for ML
All-pairs shortest path distances are powerful graph features:
import numpy as np
import networkx as nx
def compute_distance_matrix(G: nx.Graph) -> np.ndarray:
"""
Compute all-pairs shortest path distance matrix.
Useful for graph kernel methods and position encodings.
O(n * (n + m) * log n) time - feasible for n < 10^4.
"""
n = G.number_of_nodes()
nodes = sorted(G.nodes())
node_to_idx = {v: i for i, v in enumerate(nodes)}
dist_matrix = np.full((n, n), np.inf)
np.fill_diagonal(dist_matrix, 0)
for source in nodes:
dists = nx.single_source_shortest_path_length(G, source)
for target, d in dists.items():
dist_matrix[node_to_idx[source], node_to_idx[target]] = d
return dist_matrix
G = nx.karate_club_graph()
D = compute_distance_matrix(G)
print(f"Distance matrix shape: {D.shape}")
print(f"Diameter (max distance): {D[D < np.inf].max():.0f}")
print(f"Average distance: {D[D < np.inf].mean():.2f}")
# Shortest path features as node embeddings (naive, for baselines)
# Each node is represented by its distances to all other nodes
node_features_from_distances = D # (n, n) matrix - each row is a feature vector
Part 4 - Topological Sort
Topological sort orders nodes so that all edges go from earlier to later in the ordering. Only possible for DAGs.
Application in ML: Execution order for computation graphs, resolving feature dependencies, scheduling pipeline stages.
from collections import deque
def topological_sort_kahn(adj: dict) -> list:
"""
Kahn's algorithm for topological sort.
Uses BFS with in-degree tracking.
O(V + E) time.
"""
# Compute in-degrees
all_nodes = set(adj.keys())
for neighbors in adj.values():
all_nodes.update(neighbors)
in_degree = {v: 0 for v in all_nodes}
for v in adj:
for u in adj[v]:
in_degree[u] += 1
# Start with nodes that have no prerequisites
queue = deque([v for v in all_nodes if in_degree[v] == 0])
order = []
while queue:
v = queue.popleft()
order.append(v)
for u in adj.get(v, []):
in_degree[u] -= 1
if in_degree[u] == 0:
queue.append(u)
if len(order) < len(all_nodes):
raise ValueError("Graph has a cycle - topological sort impossible")
return order
# ML pipeline as a DAG
pipeline_dag = {
'raw_data': ['clean_data'],
'clean_data': ['features_A', 'features_B'],
'features_A': ['model_input'],
'features_B': ['model_input'],
'model_input': ['train_model'],
'train_model': ['evaluate'],
'evaluate': []
}
execution_order = topological_sort_kahn(pipeline_dag)
print(f"Execution order: {execution_order}")
# ['raw_data', 'clean_data', 'features_A', 'features_B', 'model_input', 'train_model', 'evaluate']
Part 5 - Minimum Spanning Tree
A minimum spanning tree (MST) of a weighted graph is a spanning tree (connected subgraph with edges) with minimum total edge weight.
Applications in ML: Clustering, dimensionality reduction initialization, data manifold approximation.
import numpy as np
import heapq
def prim_mst(adj_weighted: dict) -> list:
"""
Prim's algorithm for Minimum Spanning Tree.
Time: O((V + E) log V) with a heap.
Returns: list of edges (u, v, weight) in the MST.
"""
if not adj_weighted:
return []
start = next(iter(adj_weighted))
visited = {start}
mst_edges = []
# Heap of (weight, u, v) - u is in MST, v is candidate
candidates = [(w, start, v) for v, w in adj_weighted[start]]
heapq.heapify(candidates)
while candidates and len(visited) < len(adj_weighted):
weight, u, v = heapq.heappop(candidates)
if v in visited:
continue
visited.add(v)
mst_edges.append((u, v, weight))
for neighbor, w in adj_weighted.get(v, []):
if neighbor not in visited:
heapq.heappush(candidates, (w, v, neighbor))
return mst_edges
# NetworkX MST
import networkx as nx
G = nx.complete_graph(10)
for u, v in G.edges():
G[u][v]['weight'] = np.random.rand()
mst = nx.minimum_spanning_tree(G, algorithm='kruskal')
print(f"MST: {mst.number_of_edges()} edges, total weight: {sum(d['weight'] for u,v,d in mst.edges(data=True)):.4f}")
# MST for clustering: edges in MST connect nearest neighbors in a specific sense
# Cutting the heaviest edges in the MST gives clusters
Part 6 - PageRank
PageRank models a random surfer who follows links randomly with probability and teleports to a random node with probability .
Definition: For node :
This is the stationary distribution of a Markov chain on the graph.
import numpy as np
def pagerank(adj: dict, alpha: float = 0.85, tol: float = 1e-8, maxiter: int = 100) -> dict:
"""
Power iteration PageRank.
alpha: damping factor (probability of following a link)
"""
nodes = list(adj.keys())
n = len(nodes)
node_idx = {v: i for i, v in enumerate(nodes)}
# Build transition matrix
out_degrees = {v: len(adj.get(v, [])) for v in nodes}
# Initialize uniform
pr = {v: 1.0 / n for v in nodes}
for iteration in range(maxiter):
pr_new = {v: (1 - alpha) / n for v in nodes}
for u in nodes:
if out_degrees.get(u, 0) > 0:
contribution = alpha * pr[u] / out_degrees[u]
for v in adj.get(u, []):
pr_new[v] = pr_new.get(v, 0) + contribution
else:
# Dangling node: distribute uniformly (teleport)
for v in nodes:
pr_new[v] += alpha * pr[u] / n
# Check convergence
diff = sum(abs(pr_new[v] - pr[v]) for v in nodes)
pr = pr_new
if diff < tol:
print(f"PageRank converged in {iteration + 1} iterations")
break
# Normalize
total = sum(pr.values())
return {v: score / total for v, score in pr.items()}
# ML applications of PageRank
def personalized_pagerank(adj: dict, seed_nodes: list, alpha: float = 0.85) -> dict:
"""
Personalized PageRank (PPR): teleport to seed nodes instead of uniform random.
Used in: node recommendation, community detection, GNN pre-training.
PPR from seed nodes captures the "influence cone" of those nodes.
"""
nodes = list(adj.keys())
n = len(nodes)
# Teleportation distribution: uniform over seed nodes
seed_set = set(seed_nodes)
teleport = {v: (1.0 / len(seed_nodes)) if v in seed_set else 0.0
for v in nodes}
pr = {v: teleport[v] for v in nodes}
out_degrees = {v: len(adj.get(v, [])) for v in nodes}
for _ in range(100):
pr_new = {v: (1 - alpha) * teleport[v] for v in nodes}
for u in nodes:
if out_degrees.get(u, 0) > 0:
contribution = alpha * pr[u] / out_degrees[u]
for v in adj.get(u, []):
pr_new[v] = pr_new.get(v, 0) + contribution
pr = pr_new
return pr
PageRank as a GNN baseline
Personalized PageRank (PPR) is directly connected to GNNs. APPNP (Approximate PPR propagation) uses PPR to define the feature propagation scheme:
This is equivalent to infinite GNN layers with PPR-weighted influence, which avoids over-smoothing.
Part 7 - Graph Feature Engineering for ML
Before training a GNN, or when a GNN is overkill, compute structural features as inputs to gradient boosted trees or linear models:
import networkx as nx
import numpy as np
import pandas as pd
def extract_node_features(G: nx.Graph) -> pd.DataFrame:
"""
Extract a comprehensive set of node-level graph features.
Used as input features for XGBoost/LightGBM on graph datasets.
"""
features = {}
# Basic degree features
features['degree'] = dict(G.degree())
features['in_degree'] = dict(G.degree()) # Same for undirected
# Centrality measures
features['degree_centrality'] = nx.degree_centrality(G)
features['closeness_centrality'] = nx.closeness_centrality(G)
# Betweenness (expensive for large graphs - use approximation)
n = G.number_of_nodes()
if n <= 5000:
features['betweenness_centrality'] = nx.betweenness_centrality(G)
else:
# Approximate with sampling
features['betweenness_centrality'] = nx.betweenness_centrality(G, k=500)
# PageRank
features['pagerank'] = nx.pagerank(G, alpha=0.85)
# Local structure
features['clustering_coefficient'] = nx.clustering(G)
triangles = nx.triangles(G)
features['triangle_count'] = triangles
# Core number (k-core decomposition: remove low-degree nodes iteratively)
features['core_number'] = nx.core_number(G)
# Aggregate into DataFrame
nodes = sorted(G.nodes())
df = pd.DataFrame({
feat_name: [feat_dict[n] for n in nodes]
for feat_name, feat_dict in features.items()
}, index=nodes)
return df
# Example usage
G = nx.karate_club_graph()
node_features_df = extract_node_features(G)
print(node_features_df.head())
print(f"Feature matrix shape: {node_features_df.shape}")
Interview Questions
Q1: Compare BFS and DFS. What are the time/space complexities and when would you use each?
| Property | BFS | DFS |
|---|---|---|
| Data structure | Queue (FIFO) | Stack (LIFO) or recursion |
| Time complexity | ||
| Space complexity | - stores entire frontier | - only current path |
| Shortest paths | Yes (unweighted) | No |
| Cycle detection | Yes | Yes |
| Topological sort | Via Kahn's algorithm | Via post-order traversal |
Use BFS when:
- Finding shortest paths (unweighted graphs)
- Exploring nodes level by level (GNN -hop neighborhoods)
- Finding the closest node to a source
- Level-order traversal of trees
- Bipartite graph detection (2-color level by level)
Use DFS when:
- Topological sort of a DAG
- Finding strongly connected components (Tarjan's algorithm)
- Cycle detection
- Maze solving (find any path, not necessarily shortest)
- Generating all connected components
Space complexity matters for large graphs: BFS stores the entire frontier (up to nodes at the widest level). DFS only stores the current path from root to the deepest visited node (). For balanced trees with nodes, depth = but frontier = at the last level - DFS is much more space-efficient. For long chains, DFS depth = and may cause stack overflow - use iterative DFS.
Q2: Walk through Dijkstra's algorithm and explain why it fails for negative-weight edges.
Algorithm walkthrough for source = A on graph A-B(weight 1), A-C(weight 4), B-C(weight 2), B-D(weight 5), C-D(weight 1):
- Initialize: distances = {A:0, B:∞, C:∞, D:∞}, heap = [(0, A)]
- Pop (0, A): process A. Update neighbors: B→1, C→4. heap = [(1,B), (4,C)]
- Pop (1, B): process B. Update: C→min(4, 1+2)=3, D→6. heap = [(3,C), (4,C), (6,D)]
- Pop (3, C): process C. Update: D→min(6, 3+1)=4. heap = [(4,C_old), (4,D)]
- Pop (4, C_old): already visited, skip.
- Pop (4, D): process D. Done.
- Final: {A:0, B:1, C:3, D:4} ✓
Why it fails for negative edges: Dijkstra's greedy correctness relies on the property: once a node is popped from the heap, its distance is finalized. This holds because all edge weights are non-negative - adding more edges can only increase the path length.
With a negative edge : When we pop B with distance 1, we update C to . But if we had already finalized C at distance 4 (when we popped it earlier), that earlier finalization was wrong.
Fix for negative edges: Use Bellman-Ford ( time) - it runs relaxation passes over all edges, allowing negative-weight corrections to propagate. For graphs without negative cycles, Bellman-Ford always finds the correct shortest paths.
In ML: Edge weights in similarity graphs are often non-negative (distances, costs), so Dijkstra is commonly applicable. For graphs where edges represent "rewards" (reinforcement learning state graphs), negative edges may appear - use Bellman-Ford or ensure non-negativity by shifting weights.
Q3: What is PageRank and how is Personalized PageRank used in GNNs?
PageRank models a random surfer who:
- With probability (damping factor, typically 0.85): follows a random outgoing link
- With probability : teleports to a uniformly random node
The stationary distribution of this Markov chain is the PageRank vector :
It assigns higher scores to nodes that are linked to by other high-score nodes.
Personalized PageRank (PPR): Teleport to a specific seed set rather than uniformly. For seed node , the PPR vector gives the "influence" of on all other nodes - nodes nearby in the graph have higher PPR scores.
PPR in GNNs (APPNP):
PPR can be seen as an infinite-depth GNN with dampened propagation:
where is the normalized adjacency. APPNP separates feature transformation from propagation:
- Transform: (MLP, no propagation)
- Propagate: (PPR propagation, no parameters)
Advantages of PPR propagation:
- Avoids over-smoothing: the factor prevents the infinitely-propagated features from converging to a constant
- Long-range dependencies: PPR reaches nodes far away with appropriately decayed influence
- No learnable parameters in propagation: faster convergence, less overfitting
Q4: How do you extract graph features for a tabular ML model when a GNN would be overkill?
When graph structure is available but labeled data is scarce or a GNN is computationally expensive, structural graph features provide powerful signals for gradient boosted trees.
Feature categories:
Node-level structural features (computed once, used as columns):
- Degree (in/out), log-degree
- Clustering coefficient (local transitivity)
- Triangle count
- Core number (k-core decomposition: robust to noise)
- PageRank, Personalized PageRank from known seed nodes
- Eigenvector centrality (for smaller graphs)
Neighborhood aggregate features (first-order neighborhood statistics):
- Mean/max/min/std of neighbor degrees
- Fraction of neighbors with a specific attribute
- Number of common neighbors between nodes u and v (for edge prediction)
Topological features (global structure):
- Eccentricity, betweenness (expensive - sample for large graphs)
- Shortest path to important nodes (e.g., "hub" nodes)
- Number of paths of length k
Feature engineering pipeline:
# 1. Compute features
features_df = extract_node_features(G) # See Part 7 above
# 2. Aggregate neighborhood features
for n in G.nodes():
nbr_degrees = [G.degree(u) for u in G.neighbors(n)]
features_df.loc[n, 'mean_nbr_degree'] = np.mean(nbr_degrees) if nbr_degrees else 0
features_df.loc[n, 'max_nbr_degree'] = max(nbr_degrees) if nbr_degrees else 0
# 3. Train XGBoost on these features
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(features_df[train_mask], labels[train_mask])
When to use GNN vs feature engineering:
- GNN if: labeled data > 1000 nodes, GPU available, graph structure is primary signal
- Feature engineering if: small labeled set, interpretability needed, quick prototyping, tabular model already in production
Q5: What is the k-core decomposition and why is it useful for identifying influential nodes?
k-core decomposition: The -core of a graph is the maximal subgraph where every node has at least neighbors within the subgraph. Every node is assigned its core number - the maximum such that it belongs to the -core.
Algorithm: Iteratively remove nodes with degree less than . Nodes removed in round have core number .
import networkx as nx
G = nx.karate_club_graph()
core_numbers = nx.core_number(G)
print(f"Core number distribution:")
from collections import Counter
print(Counter(core_numbers.values()))
# {4: 10, 3: 15, 2: 7, 1: 2} - 10 nodes in the 4-core
Why core number is useful:
-
Robustness: Unlike degree centrality (a single node can have high degree by connecting to periphery), core number measures embeddedness in a dense subgraph. High core-number nodes are robust influencers - removing a few of their connections doesn't reduce their influence.
-
Information spreading: In epidemic/influence spreading models, nodes with high core number are more likely to be in the "giant cascade" - they sit in densely connected regions that sustain propagation.
-
Fraud detection: Fraudulent accounts often form dense clusters (they mutually follow/transact to inflate metrics). High core numbers within a suspicious cluster indicate collusion.
-
Preprocessing for GNNs: Core-number-stratified sampling ensures training batches cover both core (dense) and periphery (sparse) nodes - important for learning models that generalize to both.
-
Community detection: The -core hierarchy gives a natural multi-resolution view of community structure - the densest communities appear in the highest- cores.
Quick Reference
| Algorithm | Time | Space | Use Case |
|---|---|---|---|
| BFS | Shortest path (unweighted), component discovery | ||
| DFS | Topological sort, cycle detection, SCC | ||
| Dijkstra | Shortest path (non-negative weights) | ||
| Bellman-Ford | Shortest path (negative weights) | ||
| Topological sort | DAG execution order | ||
| Kruskal MST | Minimum spanning tree | ||
| PageRank | Node importance, recommendation | ||
| Core decomposition | Influence, dense subgraph detection |
Next: Lesson 04: Spectral Graph Theory →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Graph Algorithms demo on the EngineersOfAI Playground - no code required.
:::
