Skip to main content

Spectral Graph Theory - Graph Laplacian, Eigenvalues, and Spectral Clustering

Reading time: ~26 minutes | Level: Graph Theory → GNN Theory

To understand why Graph Convolutional Networks work - not just how to call GCNConv - you need spectral graph theory. When a GCN computes A^HW\hat{A} H W, it is applying a learnable spectral filter to graph signals. Over-smoothing is a spectral phenomenon: deep GCNs amplify low-frequency (smooth) components and suppress high-frequency ones.

Spectral clustering, one of the most reliable clustering algorithms, is an eigenvalue problem on the graph Laplacian. Node2Vec's random walks implicitly approximate the spectral structure of the graph.

This lesson gives you the mathematical foundation to reason about all of this.

What You Will Learn

  • The graph Laplacian L=DAL = D - A and its properties
  • Spectral decomposition: L=UΛUTL = U \Lambda U^T
  • Graph signals and the graph Fourier transform
  • Eigenvalues as frequency: smooth vs high-frequency graph signals
  • The Fiedler vector and graph connectivity
  • Spectral clustering: normalized cuts and the algorithm
  • Chebyshev polynomial approximation: the bridge to GCNs

Part 1 - The Graph Laplacian

Definition

For a graph G=(V,E)G = (V, E) with adjacency matrix AA and degree matrix D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n):

L=DAL = D - A

Explicitly: Lij={diif i=j1if (i,j)E0otherwiseL_{ij} = \begin{cases} d_i & \text{if } i = j \\ -1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

For weighted graphs: Lij=w(i,j)L_{ij} = -w(i,j) for (i,j)E(i,j) \in E and Lii=jw(i,j)L_{ii} = \sum_j w(i,j).

import numpy as np
import networkx as nx
import scipy.sparse as sp

G = nx.karate_club_graph()

# Compute Laplacian (various methods)
A = nx.to_numpy_array(G, dtype=np.float64)
D = np.diag(A.sum(axis=1))
L = D - A # Unnormalized Laplacian

# Verify with networkx
L_nx = np.array(nx.laplacian_matrix(G).todense(), dtype=np.float64)
print(f"L matches networkx: {np.allclose(L, L_nx)}") # True

# Properties
print(f"Symmetric: {np.allclose(L, L.T)}") # True
print(f"Row sums: {L.sum(axis=1)[:5]}") # All near 0
print(f"Trace = sum of degrees: {np.trace(L):.1f}, {sum(dict(G.degree()).values())}")

Key properties of the Laplacian

Property 1: LL is symmetric and positive semi-definite: xTLx=(i,j)E(xixj)20for all xRnx^T L x = \sum_{(i,j) \in E} (x_i - x_j)^2 \geq 0 \quad \text{for all } x \in \mathbb{R}^n

This quadratic form measures the "smoothness" of signal xx on the graph - it is zero if and only if xx is constant on each connected component.

Property 2: L1=0L \mathbf{1} = \mathbf{0} - the constant vector is always in the null space.

Property 3: The number of zero eigenvalues of LL equals the number of connected components.

# Property 1: Quadratic form measures signal variation
x = np.random.randn(G.number_of_nodes())
smoothness = x.T @ L @ x
# This equals sum over edges of (x[i] - x[j])^2
manual = sum((x[i] - x[j])**2 for i, j in G.edges())
print(f"x^T L x = {smoothness:.4f}")
print(f"Σ(xi - xj)^2 = {manual:.4f}")
print(f"Equal: {np.isclose(smoothness, manual)}") # True

# Property 3: zero eigenvalues count connected components
eigenvalues = np.linalg.eigvalsh(L) # Sorted eigenvalues
n_components = np.sum(eigenvalues < 1e-8)
print(f"\nZero eigenvalues: {n_components}") # Should equal nx.number_connected_components(G)
print(f"Connected components: {nx.number_connected_components(G)}")

Part 2 - Spectral Decomposition

Eigendecomposition of the Laplacian

Since LL is symmetric PSD, it has a full set of real, non-negative eigenvalues:

L=UΛUTL = U \Lambda U^T

where:

  • Λ=diag(λ1,λ2,,λn)\Lambda = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_n) with 0=λ1λ2λn0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n
  • U=[u1,u2,,un]U = [u_1, u_2, \ldots, u_n] are the orthonormal eigenvectors
import numpy as np

G = nx.karate_club_graph()
L = np.array(nx.laplacian_matrix(G).todense(), dtype=np.float64)

# Full eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(L) # eigh: symmetric, returns sorted values
# eigenvalues[0] ≈ 0 (Laplacian always has 0 eigenvalue)
# eigenvectors[:, i] = i-th eigenvector

print(f"Eigenvalues (sorted): {eigenvalues[:5]}") # [~0, λ₂, λ₃, λ₄, λ₅]
print(f"Smallest eigenvalue: {eigenvalues[0]:.2e}") # near 0
print(f"Fiedler value (λ₂): {eigenvalues[1]:.4f}") # algebraic connectivity
print(f"Largest eigenvalue: {eigenvalues[-1]:.4f}")

# Verify L = U Λ U^T
Lambda = np.diag(eigenvalues)
U = eigenvectors
L_reconstructed = U @ Lambda @ U.T
print(f"Reconstruction error: {np.max(np.abs(L - L_reconstructed)):.2e}") # near 0

The Fiedler vector (second eigenvector)

The Fiedler value λ2\lambda_2 (second smallest eigenvalue) is the algebraic connectivity of the graph:

  • λ2=0\lambda_2 = 0: graph is disconnected
  • Large λ2\lambda_2: well-connected graph (hard to disconnect)
  • Small positive λ2\lambda_2: graph is barely connected (has a bottleneck)

The corresponding Fiedler vector u2u_2 identifies the optimal graph bisection:

# Fiedler vector analysis
fiedler_value = eigenvalues[1]
fiedler_vector = eigenvectors[:, 1]

print(f"Fiedler value (algebraic connectivity): {fiedler_value:.4f}")
print(f"Fiedler vector (first 10 values): {fiedler_vector[:10]}")

# Graph bisection: sign of Fiedler vector determines partition
partition_A = np.where(fiedler_vector >= 0)[0]
partition_B = np.where(fiedler_vector < 0)[0]
print(f"\nPartition A ({len(partition_A)} nodes): {sorted(partition_A)[:10]}...")
print(f"Partition B ({len(partition_B)} nodes): {sorted(partition_B)[:10]}...")

# Count cut edges
cut_edges = sum(1 for i, j in G.edges()
if (fiedler_vector[i] >= 0) != (fiedler_vector[j] >= 0))
print(f"Cut edges: {cut_edges}")

Part 3 - Graph Signals and the Graph Fourier Transform

Graph signals

A graph signal is a function f:VRf: V \to \mathbb{R} - it assigns a real number to each node. Examples:

  • Temperature at each sensor node in an IoT network
  • Infection probability at each node in an epidemic model
  • Node feature value at each dimension in a GNN

The graph Fourier transform (GFT)

The standard Fourier transform decomposes a function into sinusoidal components (eigenfunctions of the Laplacian on the real line). The Graph Fourier Transform does the same using the eigenvectors of the graph Laplacian:

f^=UTf(transform)\hat{f} = U^T f \quad \text{(transform)} f=Uf^(inverse transform)f = U \hat{f} \quad \text{(inverse transform)}

The coefficient f^k=ukTf\hat{f}_k = u_k^T f measures how much the signal ff "aligns" with eigenvector uku_k.

import numpy as np

G = nx.karate_club_graph()
L = np.array(nx.laplacian_matrix(G).todense(), dtype=np.float64)
eigenvalues, U = np.linalg.eigh(L)

# Graph signal: club membership (the Zachary karate club split)
club_membership = np.array([1 if G.nodes[n]['club'] == 'Mr. Hi' else -1
for n in sorted(G.nodes())])

# Graph Fourier Transform
f_hat = U.T @ club_membership # (n,) vector of Fourier coefficients

# Low-frequency components (small eigenvalue = smooth signal)
print("Low-frequency Fourier coefficients (smooth components):")
for i in range(5):
print(f" λ={eigenvalues[i]:.4f}: |f̂|={abs(f_hat[i]):.4f}")

# Inverse transform (reconstruction)
f_reconstructed = U @ f_hat
print(f"\nReconstruction error: {np.max(np.abs(f_reconstructed - club_membership)):.2e}")

# Low-pass filtering: keep only low-frequency components
K = 5 # Keep only first K eigenvectors
f_lowpass = U[:, :K] @ f_hat[:K]
print(f"\nLow-pass filtered signal (first 10 values):")
print(f_lowpass[:10])
# Smooth version of the original signal

Frequency interpretation

The eigenvalue λk\lambda_k is the "frequency" of eigenvector uku_k:

  • Low frequency (λk0\lambda_k \approx 0): uku_k varies slowly across the graph - neighboring nodes have similar values. These capture community structure.
  • High frequency (λkλmax\lambda_k \approx \lambda_{\max}): uku_k alternates sign between connected nodes - maximum variation across edges. These capture noise and fine-grained structure.

This is the spectral foundation of over-smoothing: repeated application of a low-pass graph convolution (like GCN) attenuates high-frequency components and amplifies low-frequency ones, eventually making all node features identical within each connected component.

Part 4 - Normalized Laplacians

Symmetric normalized Laplacian

Lsym=D1/2LD1/2=ID1/2AD1/2L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2}

Eigenvalues lie in [0,2][0, 2]. Used in spectral GNNs and ChebNet.

Random walk normalized Laplacian

Lrw=D1L=ID1AL_{\text{rw}} = D^{-1} L = I - D^{-1} A

Eigenvalues lie in [0,2][0, 2]. Connection: D1AD^{-1} A is the transition matrix of a random walk on the graph.

import numpy as np
import networkx as nx

G = nx.karate_club_graph()
A = nx.to_numpy_array(G, dtype=np.float64)
d = A.sum(axis=1)
D_inv_sqrt = np.diag(1.0 / np.sqrt(d + 1e-10))
D_inv = np.diag(1.0 / (d + 1e-10))

I = np.eye(len(G.nodes()))

L_sym = I - D_inv_sqrt @ A @ D_inv_sqrt
L_rw = I - D_inv @ A

# Eigenvalues of symmetric normalized Laplacian
evals_sym = np.linalg.eigvalsh(L_sym)
print(f"L_sym eigenvalue range: [{evals_sym.min():.4f}, {evals_sym.max():.4f}]")
# Should be in [0, 2]

# The GCN normalized adjacency à = D̃^{-1/2} (A+I) D̃^{-1/2}
A_hat = A + I
D_hat = np.diag(A_hat.sum(axis=1))
D_hat_inv_sqrt = np.diag(1.0 / np.sqrt(np.diag(D_hat) + 1e-10))
A_norm = D_hat_inv_sqrt @ A_hat @ D_hat_inv_sqrt
# A_norm = I - L_sym_with_self_loops
# Eigenvalues of A_norm ∈ [-1, 1] - bounded, training stable
evals_A_norm = np.linalg.eigvalsh(A_norm)
print(f"GCN normalized adjacency eigenvalue range: [{evals_A_norm.min():.4f}, {evals_A_norm.max():.4f}]")

Part 5 - Spectral Clustering

Spectral clustering uses the eigenvectors of the Laplacian to embed nodes into a low-dimensional space where standard clustering (e.g., k-means) works well.

The algorithm (Normalized Spectral Clustering - Shi & Malik)

1. Compute the normalized Laplacian L_sym = D^{-1/2} L D^{-1/2}
2. Compute the k smallest eigenvectors U = [u_1, ..., u_k]
3. Form matrix Y: normalize each row of U to unit length
4. Run k-means on rows of Y to get k clusters
5. Assign each node to the cluster of its row
import numpy as np
import networkx as nx
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

def spectral_clustering(G: nx.Graph, n_clusters: int) -> np.ndarray:
"""
Normalized spectral clustering (Shi-Malik / Ng-Jordan-Weiss).
"""
n = G.number_of_nodes()

# Build sparse normalized Laplacian
import scipy.sparse as sp
from scipy.sparse.linalg import eigsh

L_sparse = sp.csr_matrix(nx.normalized_laplacian_matrix(G).astype(np.float64))

# Compute k smallest eigenvectors (excluding λ=0 for connected graphs)
# sigma=-0.01 shifts so we find eigenvalues near 0 efficiently
eigenvalues, eigenvectors = eigsh(L_sparse, k=n_clusters, which='SM')

# Sort by eigenvalue
idx = np.argsort(eigenvalues)
U = eigenvectors[:, idx] # (n, k) matrix of eigenvectors

# Normalize rows to unit length (Ng-Jordan-Weiss step)
Y = normalize(U, norm='l2', axis=1)

# K-means on the embedding
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
labels = kmeans.fit_predict(Y)

return labels

# Example: karate club graph - should recover the two factions
G = nx.karate_club_graph()
labels = spectral_clustering(G, n_clusters=2)

# Ground truth: club membership
true_labels = np.array([1 if G.nodes[n]['club'] == 'Mr. Hi' else 0
for n in sorted(G.nodes())])

# Compute accuracy (handle label permutation)
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(true_labels, labels)
print(f"Spectral clustering Adjusted Rand Index: {ari:.4f}")
# Should be high (~0.8+) for the karate club graph

print(f"Cluster 0: {np.where(labels==0)[0][:10]}...")
print(f"Cluster 1: {np.where(labels==1)[0][:10]}...")

Why spectral clustering works: normalized cuts

The normalized cut (NCut) objective minimizes:

NCut(A,B)=cut(A,B)vol(A)+cut(A,B)vol(B)\text{NCut}(A, B) = \frac{\text{cut}(A, B)}{\text{vol}(A)} + \frac{\text{cut}(A, B)}{\text{vol}(B)}

where cut(A,B)=iA,jBwij\text{cut}(A, B) = \sum_{i \in A, j \in B} w_{ij} and vol(A)=iAdi\text{vol}(A) = \sum_{i \in A} d_i.

Finding the exact minimum is NP-hard. The spectral relaxation (replacing the binary partition vector with a continuous vector and solving the resulting generalized eigenvalue problem) gives the Fiedler vector - the 2nd eigenvector of LrwL_{\text{rw}}.

Part 6 - Connection to Graph Neural Networks

Spectral graph convolution

Classical signal processing: convolution in time domain = pointwise multiplication in frequency domain.

Graph convolution (in spectral domain): For filter gθg_\theta parameterized by θ\theta:

gθx=Ugθ(Λ)UTxg_\theta * x = U g_\theta(\Lambda) U^T x

where gθ(Λ)=diag(gθ(λ1),,gθ(λn))g_\theta(\Lambda) = \text{diag}(g_\theta(\lambda_1), \ldots, g_\theta(\lambda_n)) applies the filter to each frequency component.

Problem: Computing UU and UTxU^T x requires full eigendecomposition - O(n3)O(n^3). Intractable for large graphs.

ChebNet: Chebyshev polynomial approximation

ChebNet (Defferrard et al., 2016) approximates the spectral filter with Chebyshev polynomials:

gθ(Λ)k=0KθkTk(Λ~)g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda})

where Λ~=2Λ/λmaxI\tilde{\Lambda} = 2\Lambda/\lambda_{\max} - I (eigenvalues scaled to [1,1][-1, 1]) and TkT_k is the kk-th Chebyshev polynomial.

Using the recurrence Tk(L~)x=2L~Tk1(L~)xTk2(L~)xT_k(\tilde{L}) x = 2\tilde{L} T_{k-1}(\tilde{L}) x - T_{k-2}(\tilde{L}) x:

  • No eigendecomposition needed!
  • KK sparse matrix-vector products with L~\tilde{L}
  • KK-localized in the graph (aggregates KK-hop neighborhoods)
import torch
import torch.nn as nn
from torch_geometric.nn import ChebConv, GCNConv

# GCN (Kipf & Welling 2017): simplification of ChebNet with K=1
# K=1 Chebyshev approximation gives the GCN propagation rule:
# Z = à H W where à = D̃^{-1/2} (A+I) D̃^{-1/2}

# GCN layer
class GCNLayer(nn.Module):
def __init__(self, in_channels: int, out_channels: int):
super().__init__()
self.linear = nn.Linear(in_channels, out_channels, bias=False)

def forward(self, x, A_norm):
# x: (n, in_channels)
# A_norm: (n, n) normalized adjacency with self-loops
return torch.relu(A_norm @ self.linear(x))

# PyTorch Geometric implementation
gcn_conv = GCNConv(in_channels=64, out_channels=32)
cheb_conv = ChebConv(in_channels=64, out_channels=32, K=3) # K=3 hops

Over-smoothing: a spectral perspective

With LL layers of GCN propagation, the feature matrix becomes:

H(L)=A^LH(0)W(0)W(1)W(L1)H^{(L)} = \hat{A}^L H^{(0)} W^{(0)} W^{(1)} \cdots W^{(L-1)}

The spectral effect: A^L\hat{A}^L has eigenvalues μiL\mu_i^L where μi\mu_i are the eigenvalues of A^[1,1]\hat{A} \in [-1, 1]. As LL \to \infty:

  • μi<1|\mu_i| < 1: eigenvalues decay to 0 - high-frequency components vanish
  • μi=1\mu_i = 1: only the constant (DC) component survives

All node features converge to the same value - over-smoothing. After ~5-10 GCN layers, nodes in the same connected component become indistinguishable.

import numpy as np

def simulate_over_smoothing(A_norm: np.ndarray, H: np.ndarray, n_layers: int) -> list:
"""
Simulate GCN feature propagation and measure over-smoothing.
Over-smoothing metric: mean pairwise feature distance.
"""
from scipy.spatial.distance import pdist

feature_diversity = [pdist(H).mean()]

H_current = H.copy()
for layer in range(n_layers):
H_current = A_norm @ H_current # No weights - pure propagation
diversity = pdist(H_current).mean()
feature_diversity.append(diversity)

return feature_diversity

# Run on karate club graph
G = nx.karate_club_graph()
A = nx.to_numpy_array(G, dtype=np.float64)
n = A.shape[0]
A_hat = A + np.eye(n)
D_hat_inv_sqrt = np.diag(1.0 / np.sqrt(A_hat.sum(axis=1)))
A_norm = D_hat_inv_sqrt @ A_hat @ D_hat_inv_sqrt

H0 = np.random.randn(n, 8) # Random initial features
diversity = simulate_over_smoothing(A_norm, H0, n_layers=50)

print("Feature diversity (mean pairwise distance) vs layers:")
for layer in [0, 1, 2, 5, 10, 20, 50]:
if layer < len(diversity):
print(f" Layer {layer:3d}: diversity = {diversity[layer]:.6f}")
# Diversity converges to near 0 after ~20 layers

Interview Questions

Q1: What is the graph Laplacian and what do its eigenvalues represent?

The graph Laplacian is L=DAL = D - A where DD is the diagonal degree matrix and AA is the adjacency matrix.

Key properties:

  • Symmetric, positive semi-definite
  • Row sums are zero: L1=0L \mathbf{1} = \mathbf{0}
  • Quadratic form: xTLx=(i,j)E(xixj)2x^T L x = \sum_{(i,j) \in E} (x_i - x_j)^2 - measures signal smoothness

Eigenvalue interpretation: Since LL is PSD, eigenvalues satisfy 0=λ1λ2λn0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n.

  • λ1=0\lambda_1 = 0: always (constant vector is always in null space). The corresponding eigenvector u1=1/nu_1 = \mathbf{1}/\sqrt{n} is constant.
  • Number of zero eigenvalues = number of connected components. If GG is disconnected, multiple eigenvectors are constant on each component.
  • λ2\lambda_2 (Fiedler value) = algebraic connectivity. λ2=0\lambda_2 = 0 iff graph is disconnected. Larger λ2\lambda_2 = better connected, harder to bisect.
  • λmax\lambda_{\max}: bounded by 2×maxidi2 \times \max_i d_i for irregular graphs. For regular graphs, λmax2d\lambda_{\max} \leq 2d.

Frequency analogy: Eigenvector uku_k (corresponding to λk\lambda_k) is the kk-th "frequency mode" of the graph. Small λk\lambda_k → smooth mode (neighbors have similar values). Large λk\lambda_k → oscillating mode (neighbors have opposite values).

The graph Fourier transform f^=UTf\hat{f} = U^T f decomposes any graph signal into contributions from each frequency mode, exactly like the classical Fourier transform on R\mathbb{R}.

Q2: Why does spectral clustering work? What does minimizing the normalized cut have to do with eigenvalues?

Normalized cut (NCut) measures partition quality: NCut(A,B)=cut(A,B)vol(A)+cut(A,B)vol(B)\text{NCut}(A, B) = \frac{\text{cut}(A, B)}{\text{vol}(A)} + \frac{\text{cut}(A, B)}{\text{vol}(B)}

A good partition has few edges between clusters (small cut) and balanced cluster sizes (large volumes). Minimizing NCut finds tight communities with clean separations.

The eigenvalue connection: Let h{0,1}nh \in \{0, 1\}^n be the indicator vector for partition AA. Then: NCut(A,B)=hTLhhTDh\text{NCut}(A, B) = \frac{h^T L h}{h^T D h}

Minimizing this ratio over binary hh is the Rayleigh quotient problem - but with a binary constraint, it is NP-hard.

The spectral relaxation drops the binary constraint and minimizes over continuous hh: minhhTLhhTDhs.t.hTDh=1\min_{h} \frac{h^T L h}{h^T D h} \quad \text{s.t.} \quad h^T D h = 1

This is a generalized eigenvalue problem Lh=λDhL h = \lambda D h, equivalent to the standard eigenvalue problem D1/2LD1/2v=λvD^{-1/2} L D^{-1/2} v = \lambda v for LsymL_{\text{sym}}.

The solution is the Fiedler vector (2nd eigenvector of Lrw=D1LL_{\text{rw}} = D^{-1}L) - the continuous relaxation of the optimal bisection. Thresholding (sign or k-means) recovers a discrete partition.

For k>2k > 2 clusters: use kk eigenvectors of LsymL_{\text{sym}} (the kk smallest non-zero eigenvalues). Each eigenvector captures a different "split" of the graph - together they embed nodes into Rk\mathbb{R}^k where k-means can find kk clusters.

Q3: What is the graph Fourier transform and how does it connect to GNN convolutions?

Classical Fourier transform: Decomposes a function f:RRf: \mathbb{R} \to \mathbb{R} into sinusoids (eigenfunctions of the Laplacian d2/dx2-d^2/dx^2). High-frequency components correspond to rapid oscillations.

Graph Fourier Transform (GFT): Decomposes a graph signal f:VRf: V \to \mathbb{R} into eigenvectors of the graph Laplacian L=UΛUTL = U\Lambda U^T: f^=UTf,f=Uf^\hat{f} = U^T f, \quad f = U \hat{f}

Frequency = eigenvalue: small λ\lambda → smooth signal, large λ\lambda → rapidly alternating signal.

Graph convolution: In classical signal processing, convolution is pointwise multiplication in the Fourier domain. For graphs: gθf=Ugθ(Λ)UTfg_\theta * f = U g_\theta(\Lambda) U^T f

where gθ(Λ)=diag(gθ(λ1),,gθ(λn))g_\theta(\Lambda) = \text{diag}(g_\theta(\lambda_1), \ldots, g_\theta(\lambda_n)) is a learnable spectral filter.

The GCN simplification (Kipf & Welling, 2017):

  1. Approximate spectral filter by a polynomial: gθ(λ)θ0+θ1λg_\theta(\lambda) \approx \theta_0 + \theta_1 \lambda
  2. Use scaled Laplacian L~=2L/λmaxI\tilde{L} = 2L/\lambda_{\max} - I so eigenvalues in [1,1][-1,1]
  3. With λmax2\lambda_{\max} \approx 2 (empirical): gθ(λ)θ0θ1(IAnorm)g_\theta(\lambda) \approx \theta_0 - \theta_1 (I - A_{\text{norm}})
  4. Simplify to single parameter θ=θ0=θ1\theta = \theta_0 = -\theta_1: gθ(A^)=θA^g_\theta(\hat{A}) = \theta \hat{A}

This gives the GCN layer: H(l+1)=σ(A^H(l)W(l))H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)}) - the famous normalized adjacency propagation, derived from spectral graph convolution theory.

Q4: What is over-smoothing in GNNs and how does spectral theory explain it?

Over-smoothing in GNNs: as the number of layers increases, all node features converge to the same vector, making nodes indistinguishable. Classification accuracy typically peaks at 2-4 layers and degrades with more layers.

Spectral explanation: The GCN propagation H(L)=A^LH(0)WH^{(L)} = \hat{A}^L H^{(0)} \cdot W applies the matrix A^L\hat{A}^L to the initial features. A^\hat{A} has eigenvalues in (1,1](-1, 1].

After LL steps:

  • Eigenvalue μi\mu_iμiL\mu_i^L
  • For μi<1|\mu_i| < 1: μiL0\mu_i^L \to 0 exponentially - these eigenvector components are suppressed
  • For μi=1\mu_i = 1: μiL=1\mu_i^L = 1 - only the constant (lowest-frequency) component survives

As LL \to \infty, H(L)H^{(L)} converges to α11TH(0)W\alpha \mathbf{1} \mathbf{1}^T H^{(0)} \cdot W - all nodes have the same feature vector (the global mean, modulated by eigenvector of eigenvalue 1).

Solutions:

  1. JK-Net (Jumping Knowledge Networks): Concatenate features from all layers - each layer captures different frequency range.
  2. APPNP / PPR propagation: Uses (1α)(1-\alpha) damping to prevent eigenvalue-1 components from dominating.
  3. DropEdge: Randomly drops edges during training - reduces effective number of propagation steps.
  4. PairNorm: Normalizes features to maintain distance distribution between nodes.
  5. ResGCN: Skip connections prevent features from converging - gradient "highway" bypasses smoothing.
Q5: How does the Fiedler value quantify graph connectivity? What does it mean when λ₂ is small vs large?

The Fiedler value λ2\lambda_2 (second smallest eigenvalue of LL) is the algebraic connectivity of the graph.

Formal definition: From the variational characterization: λ2=minx1,x=1xTLx=minx1,x=1(i,j)E(xixj)2\lambda_2 = \min_{x \perp \mathbf{1}, \|x\|=1} x^T L x = \min_{x \perp \mathbf{1}, \|x\|=1} \sum_{(i,j) \in E} (x_i - x_j)^2

This is the minimum "variation" of any non-constant signal on the graph.

Interpretation:

  • λ2=0\lambda_2 = 0: Graph is disconnected (two components can have constant signals with no variation across the cut).
  • Small positive λ2\lambda_2 (λ21\lambda_2 \ll 1): Bottleneck graph. There is a cut set of few edges whose removal disconnects the graph. The Fiedler vector u2u_2 identifies this bottleneck. Examples: two cliques connected by a single edge, long chains.
  • Large λ2\lambda_2: Well-connected graph. Expander graph. Random walk mixes quickly (mixing time=O(λ21logn)\text{mixing time} = O(\lambda_2^{-1} \log n)). Hard to partition into two balanced halves.

Cheeger inequality (connecting λ2\lambda_2 to the graph's conductance ϕ\phi): ϕ22λ22ϕ\frac{\phi^2}{2} \leq \lambda_2 \leq 2\phi

where ϕ=minScut(S,VS)min(vol(S),vol(VS))\phi = \min_S \frac{\text{cut}(S, V \setminus S)}{\min(\text{vol}(S), \text{vol}(V \setminus S))} is the conductance.

ML applications:

  • Random walk mixing time: O(λ21logn)O(\lambda_2^{-1} \log n) - relevant for node2vec/DeepWalk-based GNN pretraining
  • GNN information propagation: small λ2\lambda_2 means information between distant communities travels slowly through GNN layers
  • Graph generation quality: checking that generated graphs have similar λ2\lambda_2 to real graphs

Quick Reference

QuantityFormulaInterpretation
Graph LaplacianL=DAL = D - ADiscrete second derivative on graph
Quadratic formxTLx=(i,j)(xixj)2x^T L x = \sum_{(i,j)} (x_i - x_j)^2Signal smoothness
Eigenvalues λ1λn\lambda_1 \leq \ldots \leq \lambda_nAlways λ1=0\lambda_1 = 0Frequency spectrum
Number of zero eigenvalues-Number of connected components
Fiedler value λ2\lambda_2Second smallest eigenvalueAlgebraic connectivity
Fiedler vector u2u_2Second eigenvectorOptimal bisection direction
Graph Fourier Transformf^=UTf\hat{f} = U^T fDecompose signal into modes
GCN propagationA^=D1/2(A+I)D1/2\hat{A} = D^{-1/2}(A+I)D^{-1/2}K=1 Chebyshev filter
Over-smoothingA^L\hat{A}^L \to constant as LL \to \inftyLow-pass filtering effect

Next: Lesson 05: Random Graphs and Network Models →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Spectral Graph Theory demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.