Skip to main content

Sparse Matrix Methods - CSR/CSC Formats, Efficient Operations, and ML Sparsity

Reading time: ~22 minutes | Level: Numerical Methods → Systems Engineering

A BERT model with 512-token sequences has an attention mask of shape (512, 512). In a batch of 32, that is 32 attention masks - but in a typical batch, 30–50% of tokens are padding. Storing all 32 × 512 × 512 = 8.4 million mask values as dense float32 wastes memory, and applying dense attention to padded tokens wastes compute.

A social network graph has 1 billion users. The adjacency matrix would be 109×10910^9 \times 10^9 - 4×10184 \times 10^{18} bytes in dense float32. But the average user has 150 friends - only 0.000000015% of the matrix is non-zero. Sparse formats store only those 150 values.

Sparsity is everywhere in ML. Understanding sparse matrix formats is the difference between systems that scale to billions of nodes and those that crash with out-of-memory errors.

What You Will Learn

  • Why dense formats waste memory and compute for sparse data
  • CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column) formats
  • COO, LIL, DOK: when to use each format
  • SciPy sparse matrix operations: construction, arithmetic, solvers
  • Attention masks and causal masking in transformers
  • Sparse graph adjacency matrices for GNNs
  • Sparse embedding tables in recommendation systems
  • Model pruning: inducing and leveraging weight sparsity

Part 1 - The Density of Real ML Matrices

A matrix with nn rows, mm columns, and kk non-zero entries has density ρ=k/(nm)\rho = k/(nm).

MatrixSizeNon-zerosDensityDense MemorySparse Memory
Attention mask (512 tokens, 50% padding)512×512~65,536~25%1 MB0.25 MB
Document-term matrix (1M docs, 100K vocab)1M×100K~5×10⁸0.5%400 GB2 GB
Social graph adjacency (1B users, avg 150)1B×1B1.5×10¹¹~0%4 EB1.2 TB
Pruned neural network (90% sparsity)varies10% nnz10%100 MB10 MB
Graph Laplacian (citation network)1M×1M~10M0.001%4 TB80 MB

For density below ~5%, sparse formats win on both memory and compute.

Part 2 - Sparse Storage Formats

COO (Coordinate Format)

The simplest format: store three arrays - row, col, data.

import numpy as np

# Dense matrix:
# [[5, 0, 0, 0],
# [0, 8, 0, 3],
# [0, 0, 3, 0],
# [0, 6, 0, 0]]

row = np.array([0, 1, 1, 2, 3]) # row indices
col = np.array([0, 1, 3, 2, 1]) # column indices
data = np.array([5, 8, 3, 3, 6]) # values

# Memory: 3 × nnz × dtype_size = 3 × 5 × 4 = 60 bytes
# Dense equivalent: 4×4×4 = 64 bytes (barely smaller here, but scales)

COO advantages:

  • Easy to construct incrementally (append non-zeros)
  • Easy to convert to other formats
  • Supports duplicate entries (values are summed) - useful for finite elements

COO disadvantages:

  • Row/column slicing is slow (O(nnz)O(nnz))
  • Matrix-vector product requires sorting or conversion first

CSR (Compressed Sparse Row)

CSR is the most widely used format for matrix-vector products and row slicing.

Storage:

  • data[k]: value of the kk-th non-zero
  • indices[k]: column index of the kk-th non-zero
  • indptr[i]: position in data/indices where row ii begins
import scipy.sparse as sp
import numpy as np

# Same matrix as above in CSR:
# Row 0: (0, 5) → data=[5], indices=[0]
# Row 1: (1, 8), (3, 3) → data=[8, 3], indices=[1, 3]
# Row 2: (2, 3) → data=[3], indices=[2]
# Row 3: (1, 6) → data=[6], indices=[1]

data = np.array([5, 8, 3, 3, 6], dtype=np.float32)
indices = np.array([0, 1, 3, 2, 1], dtype=np.int32)
indptr = np.array([0, 1, 3, 4, 5], dtype=np.int32)
# indptr[i] = position where row i starts
# indptr[0] = 0: row 0 starts at position 0
# indptr[1] = 1: row 1 starts at position 1
# indptr[2] = 3: row 2 starts at position 3
# indptr[3] = 4: row 3 starts at position 4
# indptr[4] = 5: end of data

A_csr = sp.csr_matrix((data, indices, indptr), shape=(4, 4))

# Memory analysis
print(f"nnz = {A_csr.nnz}")
print(f"data: {A_csr.data.nbytes} bytes")
print(f"indices: {A_csr.indices.nbytes} bytes")
print(f"indptr: {A_csr.indptr.nbytes} bytes")
total = A_csr.data.nbytes + A_csr.indices.nbytes + A_csr.indptr.nbytes
print(f"Total CSR: {total} bytes")
print(f"Dense equivalent: {4*4*4} bytes")

Row ii access: data[indptr[i]:indptr[i+1]] and indices[indptr[i]:indptr[i+1]] - O(nnzi)O(nnz_i), constant time per non-zero.

Matrix-vector product y=Axy = Ax:

For each row i:
y[i] = sum over k in [indptr[i], indptr[i+1]):
data[k] * x[indices[k]]

This is O(nnz)O(nnz) - only non-zero elements participate. Compare to dense O(n2)O(n^2).

CSC (Compressed Sparse Column)

Exactly like CSR but organized by columns instead of rows:

  • indptr[j] to indptr[j+1]: elements in column jj
  • indices[k]: row index of the kk-th non-zero

When to use CSC:

  • Efficient column slicing
  • Left-multiplication: y=ATxy = A^T x (equivalent to y=ACSCxy = A_{\text{CSC}} x)
  • Solving Ax=bAx = b with sparse factorization (UMFPACK, CHOLMOD use CSC internally)
# CSC from same matrix
A_csc = A_csr.tocsc()
print(f"CSC format: {A_csc}")

# Column 1: rows 1 and 3 have non-zeros
col1 = A_csc.getcol(1).toarray()
print(col1)

Other formats

import scipy.sparse as sp

# LIL (List of Lists): best for incremental construction
A_lil = sp.lil_matrix((1000, 1000), dtype=np.float32)
# Assign individual elements efficiently
for i in range(100):
A_lil[i, i] = 1.0
A_lil[i, i+1] = -0.5
# Convert to CSR for arithmetic
A_csr_from_lil = A_lil.tocsr()

# DOK (Dictionary of Keys): O(1) random access, good for construction
A_dok = sp.dok_matrix((1000, 1000), dtype=np.float32)
A_dok[5, 7] = 3.14 # O(1) insertion

# BSR (Block Sparse Row): for block-structured problems
# Common in FEM and some neural network weight matrices

Format selection guide

Constructing the matrix?
├── Incrementally (append entries one at a time)
│ └── Use LIL → convert to CSR when done

├── From (row, col, data) triplets
│ └── Use COO → tocsr() or tocsc()

└── Already know CSR structure
└── Build CSR directly (fastest construction)

Doing computations?
├── Matrix-vector multiply (Ax)
│ └── CSR (row-oriented - cache friendly for result)

├── Vector-matrix multiply (x^T A)
│ └── CSC (column-oriented - same as A^T x with CSR)

├── Sparse linear system solve
│ └── CSC (most sparse solvers use CSC internally)

└── Random access A[i, j]
└── CSR or CSC (O(log(nnz per row/col)))

Part 3 - Sparse Operations in SciPy

import scipy.sparse as sp
import numpy as np

# ── Construction from dense ────────────────────────────────────────────────
A_dense = np.array([[1, 0, 2], [0, 3, 0], [4, 0, 5]], dtype=np.float32)
A_sparse = sp.csr_matrix(A_dense)

# ── Basic operations ───────────────────────────────────────────────────────
# Matrix-vector product
x = np.array([1.0, 2.0, 3.0])
y = A_sparse @ x # Same as A_sparse.dot(x)
print(f"Sparse Ax: {y}")

# Matrix-matrix product (result may not be sparse!)
B_sparse = sp.csr_matrix(np.eye(3, dtype=np.float32))
C = A_sparse @ B_sparse # Returns sparse if result is sparse
print(f"Sparse A@B type: {type(C)}")

# ── Arithmetic ─────────────────────────────────────────────────────────────
A2 = A_sparse + A_sparse # Sparse addition
A3 = 2.5 * A_sparse # Scalar multiplication

# ── Slicing ───────────────────────────────────────────────────────────────
row_0 = A_sparse[0, :] # Row 0 - O(nnz in row)
col_1 = A_sparse[:, 1] # Column 1

# ── Conversion ────────────────────────────────────────────────────────────
A_csc = A_sparse.tocsc()
A_coo = A_sparse.tocoo()
A_back_dense = A_sparse.toarray() # Convert back to dense

# ── Sparse linear system ──────────────────────────────────────────────────
from scipy.sparse.linalg import spsolve

n = 1000
# Create sparse positive definite system (tridiagonal)
diag_main = 4 * np.ones(n)
diag_off = -1 * np.ones(n-1)
A_system = sp.diags([diag_off, diag_main, diag_off], [-1, 0, 1],
format='csr', dtype=np.float64)

b = np.random.randn(n)
x_solution = spsolve(A_system, b)
print(f"Residual: {np.linalg.norm(b - A_system @ x_solution):.2e}")

# ── Sparse eigenvalue computation ─────────────────────────────────────────
from scipy.sparse.linalg import eigsh

# Find 5 largest eigenvalues of large sparse SPD matrix
eigenvalues, eigenvectors = eigsh(A_system, k=5, which='LM') # Largest Magnitude
print(f"5 largest eigenvalues: {eigenvalues}")

Part 4 - Attention Masks in Transformers

The attention mask is one of the most common sparse structures in modern ML.

Dense vs sparse attention masks

import numpy as np
import scipy.sparse as sp
import torch

def create_causal_mask_dense(seq_len: int) -> np.ndarray:
"""
Dense causal (autoregressive) attention mask.
Upper triangular values are -inf (or 0 for boolean mask).
Memory: seq_len^2 × 4 bytes
"""
mask = np.full((seq_len, seq_len), -np.inf, dtype=np.float32)
mask = np.tril(mask + np.inf) # Upper tri = -inf, lower+diag = 0
return mask

def create_causal_mask_sparse(seq_len: int) -> sp.spmatrix:
"""
Sparse causal mask - only store the non-inf positions.
Memory: nnz × (4 + 4) bytes (data + column index)
"""
# Lower triangular part has nnz = seq_len*(seq_len+1)/2 entries
rows = []
cols = []
for i in range(seq_len):
for j in range(i + 1): # j <= i (lower triangular)
rows.append(i)
cols.append(j)

data = np.ones(len(rows), dtype=np.float32)
return sp.csr_matrix((data, (rows, cols)), shape=(seq_len, seq_len))

# Memory comparison
seq_len = 2048
dense_mask = create_causal_mask_dense(seq_len)
dense_bytes = dense_mask.nbytes

sparse_mask = create_causal_mask_sparse(seq_len)
sparse_bytes = sparse_mask.data.nbytes + sparse_mask.indices.nbytes + sparse_mask.indptr.nbytes

print(f"Dense causal mask ({seq_len}×{seq_len}): {dense_bytes / 1e6:.1f} MB")
print(f"Sparse causal mask: {sparse_bytes / 1e6:.1f} MB")
print(f"Memory ratio: {dense_bytes / sparse_bytes:.1f}x")

# For 2048 tokens: dense=16 MB, sparse=8 MB (about 2x for triangular mask)
# For 8192 tokens: dense=256 MB, sparse=128 MB

Sparse padding mask

def apply_sparse_padding_mask(
attention_scores: torch.Tensor, # (batch, heads, seq_len, seq_len)
padding_mask: torch.Tensor, # (batch, seq_len) - True = pad token
fill_value: float = -1e9
) -> torch.Tensor:
"""
Apply padding mask without materializing full (batch, heads, seq_len, seq_len) tensor.
"""
# Expand padding mask to match attention scores shape
# padding_mask: (batch, 1, 1, seq_len) after unsqueeze
expanded_mask = padding_mask[:, None, None, :] # broadcast over heads and query positions
attention_scores = attention_scores.masked_fill(expanded_mask, fill_value)
return attention_scores

# Flash Attention (implemented in PyTorch 2.x) handles sparsity automatically
# It processes attention in tiles and skips all-masked tiles

Part 5 - Sparse Graph Adjacency Matrices

For graph neural networks, the adjacency matrix AA and normalized Laplacian L^\hat{L} are stored in sparse formats.

import scipy.sparse as sp
import numpy as np

def build_graph_adjacency(edges: list, n_nodes: int) -> sp.csr_matrix:
"""
Build sparse adjacency matrix from edge list.
Typical for GNN preprocessing.

edges: list of (u, v) tuples (undirected)
"""
rows, cols, data = [], [], []
for u, v in edges:
rows.append(u); cols.append(v); data.append(1.0)
rows.append(v); cols.append(u); data.append(1.0) # symmetric

return sp.csr_matrix(
(data, (rows, cols)),
shape=(n_nodes, n_nodes),
dtype=np.float32
)

def normalize_adjacency_gcn(A: sp.spmatrix) -> sp.spmatrix:
"""
GCN normalization: Ã = D^{-1/2} (A + I) D^{-1/2}
Where D is the degree matrix of (A + I).
This is the propagation rule in Kipf & Welling (2017).
"""
# Add self-loops: Â = A + I
A_hat = A + sp.eye(A.shape[0], dtype=np.float32)

# Degree matrix D (as vector of diagonal entries)
degrees = np.array(A_hat.sum(axis=1)).flatten()
d_inv_sqrt = np.power(degrees, -0.5)
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0 # Handle isolated nodes

# D^{-1/2} as sparse diagonal matrix
D_inv_sqrt = sp.diags(d_inv_sqrt)

# Symmetrically normalized: D^{-1/2} Â D^{-1/2}
A_norm = D_inv_sqrt @ A_hat @ D_inv_sqrt
return A_norm.tocsr()

def sparse_graph_conv(
A_norm: sp.csr_matrix,
H: np.ndarray, # (n_nodes, in_features)
W: np.ndarray # (in_features, out_features)
) -> np.ndarray:
"""
One GCN layer: H_new = σ(Ã H W)
Sparse A_norm × dense H - key operation in GNNs.
"""
# Sparse-dense matrix multiply: O(nnz × out_features)
AH = A_norm @ H # (n_nodes, in_features) - sparse × dense
return np.maximum(0, AH @ W) # ReLU activation

# Example: Cora citation network
# 2708 nodes, 5429 edges - typical benchmark
n_nodes = 2708
n_features = 1433
n_classes = 7

# Random graph for illustration (use real datasets in practice)
np.random.seed(42)
n_edges = 5429
edge_list = [(np.random.randint(0, n_nodes), np.random.randint(0, n_nodes))
for _ in range(n_edges)]

A = build_graph_adjacency(edge_list, n_nodes)
A_norm = normalize_adjacency_gcn(A)

print(f"Adjacency matrix: {A.shape}, {A.nnz} non-zeros, density={A.nnz/(n_nodes**2):.4f}")
print(f"Memory (CSR): {(A.data.nbytes + A.indices.nbytes + A.indptr.nbytes) / 1e3:.1f} KB")
print(f"Memory (dense equivalent): {n_nodes**2*4 / 1e6:.1f} MB")

# GCN forward pass
H = np.random.randn(n_nodes, n_features).astype(np.float32)
W1 = np.random.randn(n_features, 64).astype(np.float32) * 0.01
H1 = sparse_graph_conv(A_norm, H, W1)
print(f"After GCN layer 1: {H1.shape}") # (2708, 64)

Part 6 - Sparse Embeddings in Recommendation Systems

Recommendation systems have huge embedding tables with billions of items. Most interactions are sparse.

import numpy as np
import scipy.sparse as sp

class SparseEmbeddingTable:
"""
Memory-efficient sparse embedding lookup.
Used in collaborative filtering and recommendation systems.
"""
def __init__(self, n_items: int, n_users: int, embed_dim: int):
self.n_items = n_items
self.n_users = n_users
self.embed_dim = embed_dim

# User-item interaction matrix (sparse)
# For 1M users × 10M items × avg 100 interactions:
# Dense: 1M × 10M × 4 bytes = 40 TB
# Sparse: 100M × (4+4+4) bytes ≈ 1.2 GB
self.interactions = sp.csr_matrix(
(n_users, n_items), dtype=np.float32
)

def add_interactions(self, user_ids, item_ids, values):
"""Add user-item interactions."""
interaction_update = sp.csr_matrix(
(values, (user_ids, item_ids)),
shape=(self.n_users, self.n_items)
)
self.interactions += interaction_update

def user_based_cf(self, user_id: int, k: int = 10) -> np.ndarray:
"""
Find top-k items for user via item-based collaborative filtering.
Uses sparse matrix operations for efficiency.
"""
# Get user's interaction vector (sparse)
user_vec = self.interactions[user_id] # (1, n_items) sparse

# Compute similarity to all users using sparse dot product
# (n_users, n_items) × (n_items, 1) = (n_users, 1)
similarities = self.interactions @ user_vec.T # sparse × sparse = sparse

# Get most similar users (excluding self)
sims = similarities.toarray().flatten()
sims[user_id] = -np.inf
similar_users = np.argsort(sims)[-k:]

# Aggregate their items
neighbor_interactions = self.interactions[similar_users]
scores = np.array(neighbor_interactions.sum(axis=0)).flatten()

# Remove items user already interacted with
user_items = self.interactions[user_id].nonzero()[1]
scores[user_items] = -np.inf

return np.argsort(scores)[-k:][::-1]

Part 7 - Sparse Weight Matrices in Pruned Networks

Model pruning induces weight sparsity. A pruned network can be represented with sparse matrices for efficient inference.

import numpy as np
import scipy.sparse as sp

def magnitude_prune(weight: np.ndarray, sparsity: float) -> sp.csr_matrix:
"""
Magnitude-based pruning: set smallest weights to zero.

sparsity: fraction of weights to prune (e.g., 0.9 = 90% sparsity)
"""
threshold = np.percentile(np.abs(weight), 100 * sparsity)
mask = np.abs(weight) >= threshold
pruned = weight * mask
return sp.csr_matrix(pruned)

def structured_sparsity(weight: np.ndarray, n_heads: int) -> sp.bsr_matrix:
"""
Block-structured sparsity for GPU efficiency.
Prunes entire attention heads or weight blocks.
BSR (Block Sparse Row) format is better than CSR for structured patterns.
"""
block_size = weight.shape[0] // n_heads
blocks = weight.reshape(n_heads, block_size, -1)

# Prune heads with smallest L2 norm
head_norms = np.linalg.norm(blocks, axis=(1, 2))
prune_threshold = np.percentile(head_norms, 50) # Prune 50% of heads

for i in range(n_heads):
if head_norms[i] < prune_threshold:
blocks[i] = 0.0

pruned = blocks.reshape(weight.shape)
return sp.bsr_matrix(pruned, blocksize=(block_size, block_size))

# Memory comparison for a pruned transformer layer
d_model = 1024
weight = np.random.randn(d_model, d_model).astype(np.float32)

for sparsity_level in [0.5, 0.9, 0.99]:
sparse_w = magnitude_prune(weight, sparsity_level)
dense_bytes = weight.nbytes
sparse_bytes = sparse_w.data.nbytes + sparse_w.indices.nbytes + sparse_w.indptr.nbytes
print(f"Sparsity {sparsity_level:.0%}: dense={dense_bytes/1e6:.1f} MB, "
f"sparse={sparse_bytes/1e6:.1f} MB, "
f"ratio={dense_bytes/sparse_bytes:.1f}x")

:::warning Sparse matrices are not always faster in practice Sparse-dense matrix multiplication often has poor cache performance due to irregular memory access patterns. For GPUs, dense operations with cuBLAS use tensor cores that achieve teraflop throughput - sparse equivalents may actually be slower until sparsity exceeds ~95%.

Modern solutions: structured sparsity (block-sparse, N:M sparsity supported by NVIDIA Ampere+ GPUs), or semi-structured sparsity where 50% of every 2×2 block is zero - supported natively by Ampere's sparse tensor cores. :::

Interview Questions

Q1: Explain the CSR format. What is stored in data, indices, and indptr?

CSR (Compressed Sparse Row) stores a sparse matrix in three arrays:

  • data[k]: The value of the kk-th non-zero entry (reading left-to-right, top-to-bottom)
  • indices[k]: The column index of the kk-th non-zero entry
  • indptr[i]: The position in data/indices where row ii begins; indptr[i+1] is where row ii ends

Example: Matrix AA with non-zeros at positions (0,2)=5, (1,0)=3, (1,2)=7, (2,1)=4:

data = [5, 3, 7, 4]
indices = [2, 0, 2, 1]
indptr = [0, 1, 3, 4]

Row 0 non-zeros: data[0:1] = [5] at columns indices[0:1] = [2] Row 1 non-zeros: data[1:3] = [3, 7] at columns indices[1:3] = [0, 2] Row 2 non-zeros: data[3:4] = [4] at columns indices[3:4] = [1]

Memory: 2×nnz×dtype_size+(n+1)×int_size2 \times nnz \times \text{dtype\_size} + (n+1) \times \text{int\_size} - approximately twice the dense memory for just non-zeros.

Matrix-vector product efficiency: Each row ii computation accesses only data[indptr[i]:indptr[i+1]] and the corresponding columns of xx. For a dense row, this is O(n)O(n); for a sparse row with kk non-zeros, it is O(k)O(k). Total: O(nnz)O(nnz) vs O(n2)O(n^2) for dense.

Q2: When should you use CSR vs CSC vs COO, and how do you convert between them?

COO (Coordinate format): Use during construction. Stores (row, col, data) triplets. O(nnz)O(nnz) for appending. Supports duplicates (values are summed on conversion). Use when building a matrix incrementally from a stream of (i, j, value) triplets.

CSR (Compressed Sparse Row): Use for:

  • Row operations: extracting row ii is O(nnzi)O(nnz_i), fast
  • Matrix-vector products AxAx: row-oriented computation maps well to cache
  • Most sparse linear system solvers (after internal conversion)

CSC (Compressed Sparse Column): Use for:

  • Column operations: extracting column jj is O(nnzj)O(nnz_j), fast
  • Left-multiplication xTAx^T A: equivalent to ATxA^T x on CSC
  • Sparse direct solvers (SuperLU, UMFPACK internally use CSC/CSR)

Conversion costs: All conversions are O(nnz+m+n)O(nnz + m + n) - cheap.

import scipy.sparse as sp
A = sp.coo_matrix(...) # Start here if building incrementally

A_csr = A.tocsr() # For Ax operations - O(nnz + n)
A_csc = A.tocsc() # For A^Tx operations - O(nnz + m)
A_coo = A_csr.tocoo() # Back to COO

# Rule of thumb: always finish construction in COO or LIL,
# convert to CSR for all subsequent arithmetic.
Q3: How does sparsity in attention masks affect transformer training efficiency?

In a standard transformer with sequence length TT:

  • Full attention matrix: T×TT \times T - O(T2)O(T^2) memory and compute
  • For causal (autoregressive) attention: only the lower triangle is computed - 50% saving
  • For padding masks: padding tokens (typically 30-50% in NLP batches) don't need attention

Dense computation cost: Even with masking, naive dense attention computes all T2T^2 dot products and then sets masked positions to -\infty. The softmax and value-weighted sum still process all positions.

Sparse attention approaches:

  1. FlashAttention: Tile-based computation that never materializes the full T2T^2 matrix. Skips tiles where all entries are masked. Reduces memory from O(T2)O(T^2) to O(T)O(T).

  2. Longformer/BigBird: Replace dense attention with sparse patterns (sliding window + global tokens). Reduces O(T2)O(T^2) to O(Tk)O(T \cdot k) where kk is window size.

  3. Block-sparse attention: Explicitly mask out entire B×BB \times B blocks. GPU can skip sparse blocks completely.

Memory impact: For T=2048T = 2048, batch = 32, 16 heads:

  • Dense attention matrix: 32×16×20482×432 \times 16 \times 2048^2 \times 4 bytes = 8.6 GB
  • With 50% sparsity: 4.3 GB
  • FlashAttention: O(T)O(T) - only ~200 MB

This is why long-context models (100K+ tokens) require sparse attention - dense attention would need terabytes of memory.

Q4: How is sparse matrix-vector multiplication more efficient than dense, and what are its limitations?

Dense matrix-vector product: y=Axy = Ax where ARn×nA \in \mathbb{R}^{n \times n} requires n2n^2 multiplications - O(n2)O(n^2) operations.

Sparse matrix-vector product (CSR format):

for i in range(n):
y[i] = sum(data[k] * x[indices[k]] for k in range(indptr[i], indptr[i+1]))

Total operations: O(nnz)O(nnz) where nnzn2nnz \ll n^2 for sparse matrices.

For density ρ\rho (fraction of non-zeros):

  • Speedup ratio: n2/nnz=1/ρn^2 / nnz = 1/\rho
  • For ρ=0.01\rho = 0.01: 100× speedup
  • For ρ=0.0001\rho = 0.0001: 10,000× speedup

Limitations:

  1. Memory access patterns: Sparse formats access x[indices[k]] where indices are irregular. This causes cache misses - expensive on CPU/GPU. Dense operations use contiguous memory with predictable access patterns and benefit from BLAS GEMM optimizations.

  2. GPU efficiency: Dense matrix multiplications use tensor cores (A100: 312 TFLOP/s for dense GEMM vs ~50 TFLOP/s for sparse). For sparsity below ~90%, sparse GPU kernels can be slower than dense.

  3. Structured sparsity required for GPU speedup: NVIDIA's Ampere sparse tensor cores require N:M structured sparsity (e.g., 2:4 = 2 non-zeros per group of 4). Random unstructured sparsity doesn't benefit from hardware acceleration.

  4. Overhead for small matrices: For matrices smaller than ~1000×1000, sparse format overhead (pointer chasing, indirection) can make sparse slower than dense.

Q5: What is the Graph Laplacian and why is it always stored as a sparse matrix?

Graph Laplacian: For a graph G=(V,E)G = (V, E) with adjacency matrix AA and degree matrix DD (diagonal matrix with Dii=jAijD_{ii} = \sum_j A_{ij}):

L=DAL = D - A

Properties:

  • Symmetric: L=LTL = L^T
  • Positive semi-definite: xTLx=(i,j)E(xixj)20x^T L x = \sum_{(i,j) \in E} (x_i - x_j)^2 \geq 0
  • Lii=degree(i)L_{ii} = \text{degree}(i), Lij=1L_{ij} = -1 if (i,j)E(i,j) \in E, else 0
  • L1=0L\mathbf{1} = 0 (constant functions in the null space)

Why sparse: A graph with nn nodes and mm edges has LL with exactly n+2mn + 2m non-zeros (diagonal + two symmetric off-diagonal entries per edge). For sparse graphs where m=O(n)m = O(n) (each node has bounded degree):

  • nnz(L)=O(n)nnz(L) = O(n) vs n2n^2 for dense
  • Storage: O(n)O(n) vs O(n2)O(n^2)

For the Cora citation network: n=2708n = 2708, m=5429m = 5429, so nnz=2708+10858=13566nnz = 2708 + 10858 = 13566 - vs 27082=7,333,2642708^2 = 7,333,264 for dense. Sparse saves 540× memory.

ML use cases:

  • Spectral clustering: solve Lv=λvLv = \lambda v for smallest eigenvalues → stored sparse, eigsh finds k smallest efficiently
  • GCN propagation: Hnew=L~HWH_{new} = \tilde{L} H W where L~\tilde{L} is normalized Laplacian
  • Manifold learning (Laplacian Eigenmaps): graph Laplacian from k-NN graph
  • Semi-supervised learning: label propagation using graph Laplacian regularization

Quick Reference

FormatBest forRow accessCol accessArithmetic
COOConstructionSlowSlowNone
CSRAxAx, row slicingO(nnzi)O(nnz_i)SlowFast
CSCATxA^Tx, col slicingSlowO(nnzj)O(nnz_j)Fast
LILIncremental fillFastSlowNone
DOKRandom accessSlowSlowNone
BSRBlock-structuredBlockBlockFast
SparsityDense vs SparseGPU recommendation
< 50%Dense fasterAlways use dense
50–90%SimilarDense usually faster on GPU
90–99%Sparse fasterUse structured sparsity for GPU
> 99%Sparse much fasterAlways sparse

Module 08 complete - next: Module 09: Graph Theory →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sparse Matrix Methods demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.