Sparse Matrix Methods - CSR/CSC Formats, Efficient Operations, and ML Sparsity
Reading time: ~22 minutes | Level: Numerical Methods → Systems Engineering
A BERT model with 512-token sequences has an attention mask of shape (512, 512). In a batch of 32, that is 32 attention masks - but in a typical batch, 30–50% of tokens are padding. Storing all 32 × 512 × 512 = 8.4 million mask values as dense float32 wastes memory, and applying dense attention to padded tokens wastes compute.
A social network graph has 1 billion users. The adjacency matrix would be - bytes in dense float32. But the average user has 150 friends - only 0.000000015% of the matrix is non-zero. Sparse formats store only those 150 values.
Sparsity is everywhere in ML. Understanding sparse matrix formats is the difference between systems that scale to billions of nodes and those that crash with out-of-memory errors.
What You Will Learn
- Why dense formats waste memory and compute for sparse data
- CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column) formats
- COO, LIL, DOK: when to use each format
- SciPy sparse matrix operations: construction, arithmetic, solvers
- Attention masks and causal masking in transformers
- Sparse graph adjacency matrices for GNNs
- Sparse embedding tables in recommendation systems
- Model pruning: inducing and leveraging weight sparsity
Part 1 - The Density of Real ML Matrices
A matrix with rows, columns, and non-zero entries has density .
| Matrix | Size | Non-zeros | Density | Dense Memory | Sparse Memory |
|---|---|---|---|---|---|
| Attention mask (512 tokens, 50% padding) | 512×512 | ~65,536 | ~25% | 1 MB | 0.25 MB |
| Document-term matrix (1M docs, 100K vocab) | 1M×100K | ~5×10⁸ | 0.5% | 400 GB | 2 GB |
| Social graph adjacency (1B users, avg 150) | 1B×1B | 1.5×10¹¹ | ~0% | 4 EB | 1.2 TB |
| Pruned neural network (90% sparsity) | varies | 10% nnz | 10% | 100 MB | 10 MB |
| Graph Laplacian (citation network) | 1M×1M | ~10M | 0.001% | 4 TB | 80 MB |
For density below ~5%, sparse formats win on both memory and compute.
Part 2 - Sparse Storage Formats
COO (Coordinate Format)
The simplest format: store three arrays - row, col, data.
import numpy as np
# Dense matrix:
# [[5, 0, 0, 0],
# [0, 8, 0, 3],
# [0, 0, 3, 0],
# [0, 6, 0, 0]]
row = np.array([0, 1, 1, 2, 3]) # row indices
col = np.array([0, 1, 3, 2, 1]) # column indices
data = np.array([5, 8, 3, 3, 6]) # values
# Memory: 3 × nnz × dtype_size = 3 × 5 × 4 = 60 bytes
# Dense equivalent: 4×4×4 = 64 bytes (barely smaller here, but scales)
COO advantages:
- Easy to construct incrementally (append non-zeros)
- Easy to convert to other formats
- Supports duplicate entries (values are summed) - useful for finite elements
COO disadvantages:
- Row/column slicing is slow ()
- Matrix-vector product requires sorting or conversion first
CSR (Compressed Sparse Row)
CSR is the most widely used format for matrix-vector products and row slicing.
Storage:
data[k]: value of the -th non-zeroindices[k]: column index of the -th non-zeroindptr[i]: position indata/indiceswhere row begins
import scipy.sparse as sp
import numpy as np
# Same matrix as above in CSR:
# Row 0: (0, 5) → data=[5], indices=[0]
# Row 1: (1, 8), (3, 3) → data=[8, 3], indices=[1, 3]
# Row 2: (2, 3) → data=[3], indices=[2]
# Row 3: (1, 6) → data=[6], indices=[1]
data = np.array([5, 8, 3, 3, 6], dtype=np.float32)
indices = np.array([0, 1, 3, 2, 1], dtype=np.int32)
indptr = np.array([0, 1, 3, 4, 5], dtype=np.int32)
# indptr[i] = position where row i starts
# indptr[0] = 0: row 0 starts at position 0
# indptr[1] = 1: row 1 starts at position 1
# indptr[2] = 3: row 2 starts at position 3
# indptr[3] = 4: row 3 starts at position 4
# indptr[4] = 5: end of data
A_csr = sp.csr_matrix((data, indices, indptr), shape=(4, 4))
# Memory analysis
print(f"nnz = {A_csr.nnz}")
print(f"data: {A_csr.data.nbytes} bytes")
print(f"indices: {A_csr.indices.nbytes} bytes")
print(f"indptr: {A_csr.indptr.nbytes} bytes")
total = A_csr.data.nbytes + A_csr.indices.nbytes + A_csr.indptr.nbytes
print(f"Total CSR: {total} bytes")
print(f"Dense equivalent: {4*4*4} bytes")
Row access: data[indptr[i]:indptr[i+1]] and indices[indptr[i]:indptr[i+1]] - , constant time per non-zero.
Matrix-vector product :
For each row i:
y[i] = sum over k in [indptr[i], indptr[i+1]):
data[k] * x[indices[k]]
This is - only non-zero elements participate. Compare to dense .
CSC (Compressed Sparse Column)
Exactly like CSR but organized by columns instead of rows:
indptr[j]toindptr[j+1]: elements in columnindices[k]: row index of the -th non-zero
When to use CSC:
- Efficient column slicing
- Left-multiplication: (equivalent to )
- Solving with sparse factorization (UMFPACK, CHOLMOD use CSC internally)
# CSC from same matrix
A_csc = A_csr.tocsc()
print(f"CSC format: {A_csc}")
# Column 1: rows 1 and 3 have non-zeros
col1 = A_csc.getcol(1).toarray()
print(col1)
Other formats
import scipy.sparse as sp
# LIL (List of Lists): best for incremental construction
A_lil = sp.lil_matrix((1000, 1000), dtype=np.float32)
# Assign individual elements efficiently
for i in range(100):
A_lil[i, i] = 1.0
A_lil[i, i+1] = -0.5
# Convert to CSR for arithmetic
A_csr_from_lil = A_lil.tocsr()
# DOK (Dictionary of Keys): O(1) random access, good for construction
A_dok = sp.dok_matrix((1000, 1000), dtype=np.float32)
A_dok[5, 7] = 3.14 # O(1) insertion
# BSR (Block Sparse Row): for block-structured problems
# Common in FEM and some neural network weight matrices
Format selection guide
Constructing the matrix?
├── Incrementally (append entries one at a time)
│ └── Use LIL → convert to CSR when done
│
├── From (row, col, data) triplets
│ └── Use COO → tocsr() or tocsc()
│
└── Already know CSR structure
└── Build CSR directly (fastest construction)
Doing computations?
├── Matrix-vector multiply (Ax)
│ └── CSR (row-oriented - cache friendly for result)
│
├── Vector-matrix multiply (x^T A)
│ └── CSC (column-oriented - same as A^T x with CSR)
│
├── Sparse linear system solve
│ └── CSC (most sparse solvers use CSC internally)
│
└── Random access A[i, j]
└── CSR or CSC (O(log(nnz per row/col)))
Part 3 - Sparse Operations in SciPy
import scipy.sparse as sp
import numpy as np
# ── Construction from dense ────────────────────────────────────────────────
A_dense = np.array([[1, 0, 2], [0, 3, 0], [4, 0, 5]], dtype=np.float32)
A_sparse = sp.csr_matrix(A_dense)
# ── Basic operations ───────────────────────────────────────────────────────
# Matrix-vector product
x = np.array([1.0, 2.0, 3.0])
y = A_sparse @ x # Same as A_sparse.dot(x)
print(f"Sparse Ax: {y}")
# Matrix-matrix product (result may not be sparse!)
B_sparse = sp.csr_matrix(np.eye(3, dtype=np.float32))
C = A_sparse @ B_sparse # Returns sparse if result is sparse
print(f"Sparse A@B type: {type(C)}")
# ── Arithmetic ─────────────────────────────────────────────────────────────
A2 = A_sparse + A_sparse # Sparse addition
A3 = 2.5 * A_sparse # Scalar multiplication
# ── Slicing ───────────────────────────────────────────────────────────────
row_0 = A_sparse[0, :] # Row 0 - O(nnz in row)
col_1 = A_sparse[:, 1] # Column 1
# ── Conversion ────────────────────────────────────────────────────────────
A_csc = A_sparse.tocsc()
A_coo = A_sparse.tocoo()
A_back_dense = A_sparse.toarray() # Convert back to dense
# ── Sparse linear system ──────────────────────────────────────────────────
from scipy.sparse.linalg import spsolve
n = 1000
# Create sparse positive definite system (tridiagonal)
diag_main = 4 * np.ones(n)
diag_off = -1 * np.ones(n-1)
A_system = sp.diags([diag_off, diag_main, diag_off], [-1, 0, 1],
format='csr', dtype=np.float64)
b = np.random.randn(n)
x_solution = spsolve(A_system, b)
print(f"Residual: {np.linalg.norm(b - A_system @ x_solution):.2e}")
# ── Sparse eigenvalue computation ─────────────────────────────────────────
from scipy.sparse.linalg import eigsh
# Find 5 largest eigenvalues of large sparse SPD matrix
eigenvalues, eigenvectors = eigsh(A_system, k=5, which='LM') # Largest Magnitude
print(f"5 largest eigenvalues: {eigenvalues}")
Part 4 - Attention Masks in Transformers
The attention mask is one of the most common sparse structures in modern ML.
Dense vs sparse attention masks
import numpy as np
import scipy.sparse as sp
import torch
def create_causal_mask_dense(seq_len: int) -> np.ndarray:
"""
Dense causal (autoregressive) attention mask.
Upper triangular values are -inf (or 0 for boolean mask).
Memory: seq_len^2 × 4 bytes
"""
mask = np.full((seq_len, seq_len), -np.inf, dtype=np.float32)
mask = np.tril(mask + np.inf) # Upper tri = -inf, lower+diag = 0
return mask
def create_causal_mask_sparse(seq_len: int) -> sp.spmatrix:
"""
Sparse causal mask - only store the non-inf positions.
Memory: nnz × (4 + 4) bytes (data + column index)
"""
# Lower triangular part has nnz = seq_len*(seq_len+1)/2 entries
rows = []
cols = []
for i in range(seq_len):
for j in range(i + 1): # j <= i (lower triangular)
rows.append(i)
cols.append(j)
data = np.ones(len(rows), dtype=np.float32)
return sp.csr_matrix((data, (rows, cols)), shape=(seq_len, seq_len))
# Memory comparison
seq_len = 2048
dense_mask = create_causal_mask_dense(seq_len)
dense_bytes = dense_mask.nbytes
sparse_mask = create_causal_mask_sparse(seq_len)
sparse_bytes = sparse_mask.data.nbytes + sparse_mask.indices.nbytes + sparse_mask.indptr.nbytes
print(f"Dense causal mask ({seq_len}×{seq_len}): {dense_bytes / 1e6:.1f} MB")
print(f"Sparse causal mask: {sparse_bytes / 1e6:.1f} MB")
print(f"Memory ratio: {dense_bytes / sparse_bytes:.1f}x")
# For 2048 tokens: dense=16 MB, sparse=8 MB (about 2x for triangular mask)
# For 8192 tokens: dense=256 MB, sparse=128 MB
Sparse padding mask
def apply_sparse_padding_mask(
attention_scores: torch.Tensor, # (batch, heads, seq_len, seq_len)
padding_mask: torch.Tensor, # (batch, seq_len) - True = pad token
fill_value: float = -1e9
) -> torch.Tensor:
"""
Apply padding mask without materializing full (batch, heads, seq_len, seq_len) tensor.
"""
# Expand padding mask to match attention scores shape
# padding_mask: (batch, 1, 1, seq_len) after unsqueeze
expanded_mask = padding_mask[:, None, None, :] # broadcast over heads and query positions
attention_scores = attention_scores.masked_fill(expanded_mask, fill_value)
return attention_scores
# Flash Attention (implemented in PyTorch 2.x) handles sparsity automatically
# It processes attention in tiles and skips all-masked tiles
Part 5 - Sparse Graph Adjacency Matrices
For graph neural networks, the adjacency matrix and normalized Laplacian are stored in sparse formats.
import scipy.sparse as sp
import numpy as np
def build_graph_adjacency(edges: list, n_nodes: int) -> sp.csr_matrix:
"""
Build sparse adjacency matrix from edge list.
Typical for GNN preprocessing.
edges: list of (u, v) tuples (undirected)
"""
rows, cols, data = [], [], []
for u, v in edges:
rows.append(u); cols.append(v); data.append(1.0)
rows.append(v); cols.append(u); data.append(1.0) # symmetric
return sp.csr_matrix(
(data, (rows, cols)),
shape=(n_nodes, n_nodes),
dtype=np.float32
)
def normalize_adjacency_gcn(A: sp.spmatrix) -> sp.spmatrix:
"""
GCN normalization: Ã = D^{-1/2} (A + I) D^{-1/2}
Where D is the degree matrix of (A + I).
This is the propagation rule in Kipf & Welling (2017).
"""
# Add self-loops: Â = A + I
A_hat = A + sp.eye(A.shape[0], dtype=np.float32)
# Degree matrix D (as vector of diagonal entries)
degrees = np.array(A_hat.sum(axis=1)).flatten()
d_inv_sqrt = np.power(degrees, -0.5)
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0 # Handle isolated nodes
# D^{-1/2} as sparse diagonal matrix
D_inv_sqrt = sp.diags(d_inv_sqrt)
# Symmetrically normalized: D^{-1/2} Â D^{-1/2}
A_norm = D_inv_sqrt @ A_hat @ D_inv_sqrt
return A_norm.tocsr()
def sparse_graph_conv(
A_norm: sp.csr_matrix,
H: np.ndarray, # (n_nodes, in_features)
W: np.ndarray # (in_features, out_features)
) -> np.ndarray:
"""
One GCN layer: H_new = σ(Ã H W)
Sparse A_norm × dense H - key operation in GNNs.
"""
# Sparse-dense matrix multiply: O(nnz × out_features)
AH = A_norm @ H # (n_nodes, in_features) - sparse × dense
return np.maximum(0, AH @ W) # ReLU activation
# Example: Cora citation network
# 2708 nodes, 5429 edges - typical benchmark
n_nodes = 2708
n_features = 1433
n_classes = 7
# Random graph for illustration (use real datasets in practice)
np.random.seed(42)
n_edges = 5429
edge_list = [(np.random.randint(0, n_nodes), np.random.randint(0, n_nodes))
for _ in range(n_edges)]
A = build_graph_adjacency(edge_list, n_nodes)
A_norm = normalize_adjacency_gcn(A)
print(f"Adjacency matrix: {A.shape}, {A.nnz} non-zeros, density={A.nnz/(n_nodes**2):.4f}")
print(f"Memory (CSR): {(A.data.nbytes + A.indices.nbytes + A.indptr.nbytes) / 1e3:.1f} KB")
print(f"Memory (dense equivalent): {n_nodes**2*4 / 1e6:.1f} MB")
# GCN forward pass
H = np.random.randn(n_nodes, n_features).astype(np.float32)
W1 = np.random.randn(n_features, 64).astype(np.float32) * 0.01
H1 = sparse_graph_conv(A_norm, H, W1)
print(f"After GCN layer 1: {H1.shape}") # (2708, 64)
Part 6 - Sparse Embeddings in Recommendation Systems
Recommendation systems have huge embedding tables with billions of items. Most interactions are sparse.
import numpy as np
import scipy.sparse as sp
class SparseEmbeddingTable:
"""
Memory-efficient sparse embedding lookup.
Used in collaborative filtering and recommendation systems.
"""
def __init__(self, n_items: int, n_users: int, embed_dim: int):
self.n_items = n_items
self.n_users = n_users
self.embed_dim = embed_dim
# User-item interaction matrix (sparse)
# For 1M users × 10M items × avg 100 interactions:
# Dense: 1M × 10M × 4 bytes = 40 TB
# Sparse: 100M × (4+4+4) bytes ≈ 1.2 GB
self.interactions = sp.csr_matrix(
(n_users, n_items), dtype=np.float32
)
def add_interactions(self, user_ids, item_ids, values):
"""Add user-item interactions."""
interaction_update = sp.csr_matrix(
(values, (user_ids, item_ids)),
shape=(self.n_users, self.n_items)
)
self.interactions += interaction_update
def user_based_cf(self, user_id: int, k: int = 10) -> np.ndarray:
"""
Find top-k items for user via item-based collaborative filtering.
Uses sparse matrix operations for efficiency.
"""
# Get user's interaction vector (sparse)
user_vec = self.interactions[user_id] # (1, n_items) sparse
# Compute similarity to all users using sparse dot product
# (n_users, n_items) × (n_items, 1) = (n_users, 1)
similarities = self.interactions @ user_vec.T # sparse × sparse = sparse
# Get most similar users (excluding self)
sims = similarities.toarray().flatten()
sims[user_id] = -np.inf
similar_users = np.argsort(sims)[-k:]
# Aggregate their items
neighbor_interactions = self.interactions[similar_users]
scores = np.array(neighbor_interactions.sum(axis=0)).flatten()
# Remove items user already interacted with
user_items = self.interactions[user_id].nonzero()[1]
scores[user_items] = -np.inf
return np.argsort(scores)[-k:][::-1]
Part 7 - Sparse Weight Matrices in Pruned Networks
Model pruning induces weight sparsity. A pruned network can be represented with sparse matrices for efficient inference.
import numpy as np
import scipy.sparse as sp
def magnitude_prune(weight: np.ndarray, sparsity: float) -> sp.csr_matrix:
"""
Magnitude-based pruning: set smallest weights to zero.
sparsity: fraction of weights to prune (e.g., 0.9 = 90% sparsity)
"""
threshold = np.percentile(np.abs(weight), 100 * sparsity)
mask = np.abs(weight) >= threshold
pruned = weight * mask
return sp.csr_matrix(pruned)
def structured_sparsity(weight: np.ndarray, n_heads: int) -> sp.bsr_matrix:
"""
Block-structured sparsity for GPU efficiency.
Prunes entire attention heads or weight blocks.
BSR (Block Sparse Row) format is better than CSR for structured patterns.
"""
block_size = weight.shape[0] // n_heads
blocks = weight.reshape(n_heads, block_size, -1)
# Prune heads with smallest L2 norm
head_norms = np.linalg.norm(blocks, axis=(1, 2))
prune_threshold = np.percentile(head_norms, 50) # Prune 50% of heads
for i in range(n_heads):
if head_norms[i] < prune_threshold:
blocks[i] = 0.0
pruned = blocks.reshape(weight.shape)
return sp.bsr_matrix(pruned, blocksize=(block_size, block_size))
# Memory comparison for a pruned transformer layer
d_model = 1024
weight = np.random.randn(d_model, d_model).astype(np.float32)
for sparsity_level in [0.5, 0.9, 0.99]:
sparse_w = magnitude_prune(weight, sparsity_level)
dense_bytes = weight.nbytes
sparse_bytes = sparse_w.data.nbytes + sparse_w.indices.nbytes + sparse_w.indptr.nbytes
print(f"Sparsity {sparsity_level:.0%}: dense={dense_bytes/1e6:.1f} MB, "
f"sparse={sparse_bytes/1e6:.1f} MB, "
f"ratio={dense_bytes/sparse_bytes:.1f}x")
:::warning Sparse matrices are not always faster in practice Sparse-dense matrix multiplication often has poor cache performance due to irregular memory access patterns. For GPUs, dense operations with cuBLAS use tensor cores that achieve teraflop throughput - sparse equivalents may actually be slower until sparsity exceeds ~95%.
Modern solutions: structured sparsity (block-sparse, N:M sparsity supported by NVIDIA Ampere+ GPUs), or semi-structured sparsity where 50% of every 2×2 block is zero - supported natively by Ampere's sparse tensor cores. :::
Interview Questions
Q1: Explain the CSR format. What is stored in data, indices, and indptr?
CSR (Compressed Sparse Row) stores a sparse matrix in three arrays:
data[k]: The value of the -th non-zero entry (reading left-to-right, top-to-bottom)indices[k]: The column index of the -th non-zero entryindptr[i]: The position indata/indiceswhere row begins;indptr[i+1]is where row ends
Example: Matrix with non-zeros at positions (0,2)=5, (1,0)=3, (1,2)=7, (2,1)=4:
data = [5, 3, 7, 4]
indices = [2, 0, 2, 1]
indptr = [0, 1, 3, 4]
Row 0 non-zeros: data[0:1] = [5] at columns indices[0:1] = [2]
Row 1 non-zeros: data[1:3] = [3, 7] at columns indices[1:3] = [0, 2]
Row 2 non-zeros: data[3:4] = [4] at columns indices[3:4] = [1]
Memory: - approximately twice the dense memory for just non-zeros.
Matrix-vector product efficiency: Each row computation accesses only data[indptr[i]:indptr[i+1]] and the corresponding columns of . For a dense row, this is ; for a sparse row with non-zeros, it is . Total: vs for dense.
Q2: When should you use CSR vs CSC vs COO, and how do you convert between them?
COO (Coordinate format): Use during construction. Stores (row, col, data) triplets. for appending. Supports duplicates (values are summed on conversion). Use when building a matrix incrementally from a stream of (i, j, value) triplets.
CSR (Compressed Sparse Row): Use for:
- Row operations: extracting row is , fast
- Matrix-vector products : row-oriented computation maps well to cache
- Most sparse linear system solvers (after internal conversion)
CSC (Compressed Sparse Column): Use for:
- Column operations: extracting column is , fast
- Left-multiplication : equivalent to on CSC
- Sparse direct solvers (SuperLU, UMFPACK internally use CSC/CSR)
Conversion costs: All conversions are - cheap.
import scipy.sparse as sp
A = sp.coo_matrix(...) # Start here if building incrementally
A_csr = A.tocsr() # For Ax operations - O(nnz + n)
A_csc = A.tocsc() # For A^Tx operations - O(nnz + m)
A_coo = A_csr.tocoo() # Back to COO
# Rule of thumb: always finish construction in COO or LIL,
# convert to CSR for all subsequent arithmetic.
Q3: How does sparsity in attention masks affect transformer training efficiency?
In a standard transformer with sequence length :
- Full attention matrix: - memory and compute
- For causal (autoregressive) attention: only the lower triangle is computed - 50% saving
- For padding masks: padding tokens (typically 30-50% in NLP batches) don't need attention
Dense computation cost: Even with masking, naive dense attention computes all dot products and then sets masked positions to . The softmax and value-weighted sum still process all positions.
Sparse attention approaches:
-
FlashAttention: Tile-based computation that never materializes the full matrix. Skips tiles where all entries are masked. Reduces memory from to .
-
Longformer/BigBird: Replace dense attention with sparse patterns (sliding window + global tokens). Reduces to where is window size.
-
Block-sparse attention: Explicitly mask out entire blocks. GPU can skip sparse blocks completely.
Memory impact: For , batch = 32, 16 heads:
- Dense attention matrix: bytes = 8.6 GB
- With 50% sparsity: 4.3 GB
- FlashAttention: - only ~200 MB
This is why long-context models (100K+ tokens) require sparse attention - dense attention would need terabytes of memory.
Q4: How is sparse matrix-vector multiplication more efficient than dense, and what are its limitations?
Dense matrix-vector product: where requires multiplications - operations.
Sparse matrix-vector product (CSR format):
for i in range(n):
y[i] = sum(data[k] * x[indices[k]] for k in range(indptr[i], indptr[i+1]))
Total operations: where for sparse matrices.
For density (fraction of non-zeros):
- Speedup ratio:
- For : 100× speedup
- For : 10,000× speedup
Limitations:
-
Memory access patterns: Sparse formats access
x[indices[k]]where indices are irregular. This causes cache misses - expensive on CPU/GPU. Dense operations use contiguous memory with predictable access patterns and benefit from BLAS GEMM optimizations. -
GPU efficiency: Dense matrix multiplications use tensor cores (A100: 312 TFLOP/s for dense GEMM vs ~50 TFLOP/s for sparse). For sparsity below ~90%, sparse GPU kernels can be slower than dense.
-
Structured sparsity required for GPU speedup: NVIDIA's Ampere sparse tensor cores require N:M structured sparsity (e.g., 2:4 = 2 non-zeros per group of 4). Random unstructured sparsity doesn't benefit from hardware acceleration.
-
Overhead for small matrices: For matrices smaller than ~1000×1000, sparse format overhead (pointer chasing, indirection) can make sparse slower than dense.
Q5: What is the Graph Laplacian and why is it always stored as a sparse matrix?
Graph Laplacian: For a graph with adjacency matrix and degree matrix (diagonal matrix with ):
Properties:
- Symmetric:
- Positive semi-definite:
- , if , else 0
- (constant functions in the null space)
Why sparse: A graph with nodes and edges has with exactly non-zeros (diagonal + two symmetric off-diagonal entries per edge). For sparse graphs where (each node has bounded degree):
- vs for dense
- Storage: vs
For the Cora citation network: , , so - vs for dense. Sparse saves 540× memory.
ML use cases:
- Spectral clustering: solve for smallest eigenvalues → stored sparse, eigsh finds k smallest efficiently
- GCN propagation: where is normalized Laplacian
- Manifold learning (Laplacian Eigenmaps): graph Laplacian from k-NN graph
- Semi-supervised learning: label propagation using graph Laplacian regularization
Quick Reference
| Format | Best for | Row access | Col access | Arithmetic |
|---|---|---|---|---|
| COO | Construction | Slow | Slow | None |
| CSR | , row slicing | Slow | Fast | |
| CSC | , col slicing | Slow | Fast | |
| LIL | Incremental fill | Fast | Slow | None |
| DOK | Random access | Slow | Slow | None |
| BSR | Block-structured | Block | Block | Fast |
| Sparsity | Dense vs Sparse | GPU recommendation |
|---|---|---|
| < 50% | Dense faster | Always use dense |
| 50–90% | Similar | Dense usually faster on GPU |
| 90–99% | Sparse faster | Use structured sparsity for GPU |
| > 99% | Sparse much faster | Always sparse |
Module 08 complete - next: Module 09: Graph Theory →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Sparse Matrix Methods demo on the EngineersOfAI Playground - no code required.
:::
