Sparse Matrix Methods - CSR/CSC Formats, Efficient Operations, and ML Sparsity

Reading time: ~22 minutes | Level: Numerical Methods → Systems Engineering

A BERT model with 512-token sequences has an attention mask of shape (512, 512). In a batch of 32, that is 32 attention masks - but in a typical batch, 30–50% of tokens are padding. Storing all 32 × 512 × 512 = 8.4 million mask values as dense float32 wastes memory, and applying dense attention to padded tokens wastes compute.

A social network graph has 1 billion users. The adjacency matrix would be $10^9 \times 10^9$ - $4 \times 10^{18}$ bytes in dense float32. But the average user has 150 friends - only 0.000000015% of the matrix is non-zero. Sparse formats store only those 150 values.

Sparsity is everywhere in ML. Understanding sparse matrix formats is the difference between systems that scale to billions of nodes and those that crash with out-of-memory errors.

What You Will Learn

Why dense formats waste memory and compute for sparse data
CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column) formats
COO, LIL, DOK: when to use each format
SciPy sparse matrix operations: construction, arithmetic, solvers
Attention masks and causal masking in transformers
Sparse graph adjacency matrices for GNNs
Sparse embedding tables in recommendation systems
Model pruning: inducing and leveraging weight sparsity

Part 1 - The Density of Real ML Matrices

A matrix with $n$ rows, $m$ columns, and $k$ non-zero entries has density $\rho = k/(nm)$ .

Matrix	Size	Non-zeros	Density	Dense Memory	Sparse Memory
Attention mask (512 tokens, 50% padding)	512×512	~65,536	~25%	1 MB	0.25 MB
Document-term matrix (1M docs, 100K vocab)	1M×100K	~5×10⁸	0.5%	400 GB	2 GB
Social graph adjacency (1B users, avg 150)	1B×1B	1.5×10¹¹	~0%	4 EB	1.2 TB
Pruned neural network (90% sparsity)	varies	10% nnz	10%	100 MB	10 MB
Graph Laplacian (citation network)	1M×1M	~10M	0.001%	4 TB	80 MB

For density below ~5%, sparse formats win on both memory and compute.

Part 2 - Sparse Storage Formats

COO (Coordinate Format)

The simplest format: store three arrays - row, col, data.

import numpy as np

# Dense matrix:
# [[5, 0, 0, 0],
#  [0, 8, 0, 3],
#  [0, 0, 3, 0],
#  [0, 6, 0, 0]]

row  = np.array([0, 1, 1, 2, 3])      # row indices
col  = np.array([0, 1, 3, 2, 1])      # column indices
data = np.array([5, 8, 3, 3, 6])      # values

# Memory: 3 × nnz × dtype_size = 3 × 5 × 4 = 60 bytes
# Dense equivalent: 4×4×4 = 64 bytes (barely smaller here, but scales)

COO advantages:

Easy to construct incrementally (append non-zeros)
Easy to convert to other formats
Supports duplicate entries (values are summed) - useful for finite elements

COO disadvantages:

Row/column slicing is slow ( $O(nnz)$ )
Matrix-vector product requires sorting or conversion first

CSR (Compressed Sparse Row)

CSR is the most widely used format for matrix-vector products and row slicing.

Storage:

data[k]: value of the $k$ -th non-zero
indices[k]: column index of the $k$ -th non-zero
indptr[i]: position in data/indices where row $i$ begins

import scipy.sparse as sp
import numpy as np

# Same matrix as above in CSR:
# Row 0: (0, 5)            → data=[5],    indices=[0]
# Row 1: (1, 8), (3, 3)   → data=[8, 3], indices=[1, 3]
# Row 2: (2, 3)            → data=[3],    indices=[2]
# Row 3: (1, 6)            → data=[6],    indices=[1]

data    = np.array([5, 8, 3, 3, 6],    dtype=np.float32)
indices = np.array([0, 1, 3, 2, 1],    dtype=np.int32)
indptr  = np.array([0, 1, 3, 4, 5],    dtype=np.int32)
# indptr[i] = position where row i starts
# indptr[0] = 0: row 0 starts at position 0
# indptr[1] = 1: row 1 starts at position 1
# indptr[2] = 3: row 2 starts at position 3
# indptr[3] = 4: row 3 starts at position 4
# indptr[4] = 5: end of data

A_csr = sp.csr_matrix((data, indices, indptr), shape=(4, 4))

# Memory analysis
print(f"nnz = {A_csr.nnz}")
print(f"data:    {A_csr.data.nbytes} bytes")
print(f"indices: {A_csr.indices.nbytes} bytes")
print(f"indptr:  {A_csr.indptr.nbytes} bytes")
total = A_csr.data.nbytes + A_csr.indices.nbytes + A_csr.indptr.nbytes
print(f"Total CSR: {total} bytes")
print(f"Dense equivalent: {4*4*4} bytes")

Row $i$ access: data[indptr[i]:indptr[i+1]] and indices[indptr[i]:indptr[i+1]] - $O(nnz_i)$ , constant time per non-zero.

Matrix-vector product $y = Ax$ :

For each row i:
    y[i] = sum over k in [indptr[i], indptr[i+1]):
               data[k] * x[indices[k]]

This is $O(nnz)$ - only non-zero elements participate. Compare to dense $O(n^2)$ .

CSC (Compressed Sparse Column)

Exactly like CSR but organized by columns instead of rows:

indptr[j] to indptr[j+1]: elements in column $j$
indices[k]: row index of the $k$ -th non-zero

When to use CSC:

Efficient column slicing
Left-multiplication: $y = A^T x$ (equivalent to $y = A_{\text{CSC}} x$ )
Solving $Ax = b$ with sparse factorization (UMFPACK, CHOLMOD use CSC internally)

# CSC from same matrix
A_csc = A_csr.tocsc()
print(f"CSC format: {A_csc}")

# Column 1: rows 1 and 3 have non-zeros
col1 = A_csc.getcol(1).toarray()
print(col1)

Other formats

import scipy.sparse as sp

# LIL (List of Lists): best for incremental construction
A_lil = sp.lil_matrix((1000, 1000), dtype=np.float32)
# Assign individual elements efficiently
for i in range(100):
    A_lil[i, i] = 1.0
    A_lil[i, i+1] = -0.5
# Convert to CSR for arithmetic
A_csr_from_lil = A_lil.tocsr()

# DOK (Dictionary of Keys): O(1) random access, good for construction
A_dok = sp.dok_matrix((1000, 1000), dtype=np.float32)
A_dok[5, 7] = 3.14   # O(1) insertion

# BSR (Block Sparse Row): for block-structured problems
# Common in FEM and some neural network weight matrices

Format selection guide

Constructing the matrix?
├── Incrementally (append entries one at a time)
│   └── Use LIL → convert to CSR when done
│
├── From (row, col, data) triplets
│   └── Use COO → tocsr() or tocsc()
│
└── Already know CSR structure
    └── Build CSR directly (fastest construction)

Doing computations?
├── Matrix-vector multiply (Ax)
│   └── CSR (row-oriented - cache friendly for result)
│
├── Vector-matrix multiply (x^T A)
│   └── CSC (column-oriented - same as A^T x with CSR)
│
├── Sparse linear system solve
│   └── CSC (most sparse solvers use CSC internally)
│
└── Random access A[i, j]
    └── CSR or CSC (O(log(nnz per row/col)))

Part 3 - Sparse Operations in SciPy

import scipy.sparse as sp
import numpy as np

# ── Construction from dense ────────────────────────────────────────────────
A_dense = np.array([[1, 0, 2], [0, 3, 0], [4, 0, 5]], dtype=np.float32)
A_sparse = sp.csr_matrix(A_dense)

# ── Basic operations ───────────────────────────────────────────────────────
# Matrix-vector product
x = np.array([1.0, 2.0, 3.0])
y = A_sparse @ x                   # Same as A_sparse.dot(x)
print(f"Sparse Ax: {y}")

# Matrix-matrix product (result may not be sparse!)
B_sparse = sp.csr_matrix(np.eye(3, dtype=np.float32))
C = A_sparse @ B_sparse            # Returns sparse if result is sparse
print(f"Sparse A@B type: {type(C)}")

# ── Arithmetic ─────────────────────────────────────────────────────────────
A2 = A_sparse + A_sparse           # Sparse addition
A3 = 2.5 * A_sparse                # Scalar multiplication

# ── Slicing ───────────────────────────────────────────────────────────────
row_0 = A_sparse[0, :]             # Row 0 - O(nnz in row)
col_1 = A_sparse[:, 1]             # Column 1

# ── Conversion ────────────────────────────────────────────────────────────
A_csc = A_sparse.tocsc()
A_coo = A_sparse.tocoo()
A_back_dense = A_sparse.toarray() # Convert back to dense

# ── Sparse linear system ──────────────────────────────────────────────────
from scipy.sparse.linalg import spsolve

n = 1000
# Create sparse positive definite system (tridiagonal)
diag_main = 4 * np.ones(n)
diag_off = -1 * np.ones(n-1)
A_system = sp.diags([diag_off, diag_main, diag_off], [-1, 0, 1],
                    format='csr', dtype=np.float64)

b = np.random.randn(n)
x_solution = spsolve(A_system, b)
print(f"Residual: {np.linalg.norm(b - A_system @ x_solution):.2e}")

# ── Sparse eigenvalue computation ─────────────────────────────────────────
from scipy.sparse.linalg import eigsh

# Find 5 largest eigenvalues of large sparse SPD matrix
eigenvalues, eigenvectors = eigsh(A_system, k=5, which='LM')  # Largest Magnitude
print(f"5 largest eigenvalues: {eigenvalues}")

Part 4 - Attention Masks in Transformers

The attention mask is one of the most common sparse structures in modern ML.

Dense vs sparse attention masks

import numpy as np
import scipy.sparse as sp
import torch

def create_causal_mask_dense(seq_len: int) -> np.ndarray:
    """
    Dense causal (autoregressive) attention mask.
    Upper triangular values are -inf (or 0 for boolean mask).
    Memory: seq_len^2 × 4 bytes
    """
    mask = np.full((seq_len, seq_len), -np.inf, dtype=np.float32)
    mask = np.tril(mask + np.inf)   # Upper tri = -inf, lower+diag = 0
    return mask

def create_causal_mask_sparse(seq_len: int) -> sp.spmatrix:
    """
    Sparse causal mask - only store the non-inf positions.
    Memory: nnz × (4 + 4) bytes (data + column index)
    """
    # Lower triangular part has nnz = seq_len*(seq_len+1)/2 entries
    rows = []
    cols = []
    for i in range(seq_len):
        for j in range(i + 1):  # j <= i (lower triangular)
            rows.append(i)
            cols.append(j)

    data = np.ones(len(rows), dtype=np.float32)
    return sp.csr_matrix((data, (rows, cols)), shape=(seq_len, seq_len))

# Memory comparison
seq_len = 2048
dense_mask = create_causal_mask_dense(seq_len)
dense_bytes = dense_mask.nbytes

sparse_mask = create_causal_mask_sparse(seq_len)
sparse_bytes = sparse_mask.data.nbytes + sparse_mask.indices.nbytes + sparse_mask.indptr.nbytes

print(f"Dense causal mask ({seq_len}×{seq_len}): {dense_bytes / 1e6:.1f} MB")
print(f"Sparse causal mask: {sparse_bytes / 1e6:.1f} MB")
print(f"Memory ratio: {dense_bytes / sparse_bytes:.1f}x")

# For 2048 tokens: dense=16 MB, sparse=8 MB (about 2x for triangular mask)
# For 8192 tokens: dense=256 MB, sparse=128 MB

Sparse padding mask

def apply_sparse_padding_mask(
    attention_scores: torch.Tensor,    # (batch, heads, seq_len, seq_len)
    padding_mask: torch.Tensor,        # (batch, seq_len) - True = pad token
    fill_value: float = -1e9
) -> torch.Tensor:
    """
    Apply padding mask without materializing full (batch, heads, seq_len, seq_len) tensor.
    """
    # Expand padding mask to match attention scores shape
    # padding_mask: (batch, 1, 1, seq_len) after unsqueeze
    expanded_mask = padding_mask[:, None, None, :]  # broadcast over heads and query positions
    attention_scores = attention_scores.masked_fill(expanded_mask, fill_value)
    return attention_scores

# Flash Attention (implemented in PyTorch 2.x) handles sparsity automatically
# It processes attention in tiles and skips all-masked tiles

Part 5 - Sparse Graph Adjacency Matrices

For graph neural networks, the adjacency matrix $A$ and normalized Laplacian $\hat{L}$ are stored in sparse formats.

import scipy.sparse as sp
import numpy as np

def build_graph_adjacency(edges: list, n_nodes: int) -> sp.csr_matrix:
    """
    Build sparse adjacency matrix from edge list.
    Typical for GNN preprocessing.

    edges: list of (u, v) tuples (undirected)
    """
    rows, cols, data = [], [], []
    for u, v in edges:
        rows.append(u); cols.append(v); data.append(1.0)
        rows.append(v); cols.append(u); data.append(1.0)  # symmetric

    return sp.csr_matrix(
        (data, (rows, cols)),
        shape=(n_nodes, n_nodes),
        dtype=np.float32
    )

def normalize_adjacency_gcn(A: sp.spmatrix) -> sp.spmatrix:
    """
    GCN normalization: Ã = D^{-1/2} (A + I) D^{-1/2}
    Where D is the degree matrix of (A + I).
    This is the propagation rule in Kipf & Welling (2017).
    """
    # Add self-loops: Â = A + I
    A_hat = A + sp.eye(A.shape[0], dtype=np.float32)

    # Degree matrix D (as vector of diagonal entries)
    degrees = np.array(A_hat.sum(axis=1)).flatten()
    d_inv_sqrt = np.power(degrees, -0.5)
    d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.0   # Handle isolated nodes

    # D^{-1/2} as sparse diagonal matrix
    D_inv_sqrt = sp.diags(d_inv_sqrt)

    # Symmetrically normalized: D^{-1/2} Â D^{-1/2}
    A_norm = D_inv_sqrt @ A_hat @ D_inv_sqrt
    return A_norm.tocsr()

def sparse_graph_conv(
    A_norm: sp.csr_matrix,
    H: np.ndarray,   # (n_nodes, in_features)
    W: np.ndarray    # (in_features, out_features)
) -> np.ndarray:
    """
    One GCN layer: H_new = σ(Ã H W)
    Sparse A_norm × dense H - key operation in GNNs.
    """
    # Sparse-dense matrix multiply: O(nnz × out_features)
    AH = A_norm @ H           # (n_nodes, in_features) - sparse × dense
    return np.maximum(0, AH @ W)  # ReLU activation

# Example: Cora citation network
# 2708 nodes, 5429 edges - typical benchmark
n_nodes = 2708
n_features = 1433
n_classes = 7

# Random graph for illustration (use real datasets in practice)
np.random.seed(42)
n_edges = 5429
edge_list = [(np.random.randint(0, n_nodes), np.random.randint(0, n_nodes))
             for _ in range(n_edges)]

A = build_graph_adjacency(edge_list, n_nodes)
A_norm = normalize_adjacency_gcn(A)

print(f"Adjacency matrix: {A.shape}, {A.nnz} non-zeros, density={A.nnz/(n_nodes**2):.4f}")
print(f"Memory (CSR): {(A.data.nbytes + A.indices.nbytes + A.indptr.nbytes) / 1e3:.1f} KB")
print(f"Memory (dense equivalent): {n_nodes**2*4 / 1e6:.1f} MB")

# GCN forward pass
H = np.random.randn(n_nodes, n_features).astype(np.float32)
W1 = np.random.randn(n_features, 64).astype(np.float32) * 0.01
H1 = sparse_graph_conv(A_norm, H, W1)
print(f"After GCN layer 1: {H1.shape}")  # (2708, 64)

Part 6 - Sparse Embeddings in Recommendation Systems

Recommendation systems have huge embedding tables with billions of items. Most interactions are sparse.

import numpy as np
import scipy.sparse as sp

class SparseEmbeddingTable:
    """
    Memory-efficient sparse embedding lookup.
    Used in collaborative filtering and recommendation systems.
    """
    def __init__(self, n_items: int, n_users: int, embed_dim: int):
        self.n_items = n_items
        self.n_users = n_users
        self.embed_dim = embed_dim

        # User-item interaction matrix (sparse)
        # For 1M users × 10M items × avg 100 interactions:
        # Dense: 1M × 10M × 4 bytes = 40 TB
        # Sparse: 100M × (4+4+4) bytes ≈ 1.2 GB
        self.interactions = sp.csr_matrix(
            (n_users, n_items), dtype=np.float32
        )

    def add_interactions(self, user_ids, item_ids, values):
        """Add user-item interactions."""
        interaction_update = sp.csr_matrix(
            (values, (user_ids, item_ids)),
            shape=(self.n_users, self.n_items)
        )
        self.interactions += interaction_update

    def user_based_cf(self, user_id: int, k: int = 10) -> np.ndarray:
        """
        Find top-k items for user via item-based collaborative filtering.
        Uses sparse matrix operations for efficiency.
        """
        # Get user's interaction vector (sparse)
        user_vec = self.interactions[user_id]   # (1, n_items) sparse

        # Compute similarity to all users using sparse dot product
        # (n_users, n_items) × (n_items, 1) = (n_users, 1)
        similarities = self.interactions @ user_vec.T   # sparse × sparse = sparse

        # Get most similar users (excluding self)
        sims = similarities.toarray().flatten()
        sims[user_id] = -np.inf
        similar_users = np.argsort(sims)[-k:]

        # Aggregate their items
        neighbor_interactions = self.interactions[similar_users]
        scores = np.array(neighbor_interactions.sum(axis=0)).flatten()

        # Remove items user already interacted with
        user_items = self.interactions[user_id].nonzero()[1]
        scores[user_items] = -np.inf

        return np.argsort(scores)[-k:][::-1]

Part 7 - Sparse Weight Matrices in Pruned Networks

Model pruning induces weight sparsity. A pruned network can be represented with sparse matrices for efficient inference.

import numpy as np
import scipy.sparse as sp

def magnitude_prune(weight: np.ndarray, sparsity: float) -> sp.csr_matrix:
    """
    Magnitude-based pruning: set smallest weights to zero.

    sparsity: fraction of weights to prune (e.g., 0.9 = 90% sparsity)
    """
    threshold = np.percentile(np.abs(weight), 100 * sparsity)
    mask = np.abs(weight) >= threshold
    pruned = weight * mask
    return sp.csr_matrix(pruned)

def structured_sparsity(weight: np.ndarray, n_heads: int) -> sp.bsr_matrix:
    """
    Block-structured sparsity for GPU efficiency.
    Prunes entire attention heads or weight blocks.
    BSR (Block Sparse Row) format is better than CSR for structured patterns.
    """
    block_size = weight.shape[0] // n_heads
    blocks = weight.reshape(n_heads, block_size, -1)

    # Prune heads with smallest L2 norm
    head_norms = np.linalg.norm(blocks, axis=(1, 2))
    prune_threshold = np.percentile(head_norms, 50)  # Prune 50% of heads

    for i in range(n_heads):
        if head_norms[i] < prune_threshold:
            blocks[i] = 0.0

    pruned = blocks.reshape(weight.shape)
    return sp.bsr_matrix(pruned, blocksize=(block_size, block_size))

# Memory comparison for a pruned transformer layer
d_model = 1024
weight = np.random.randn(d_model, d_model).astype(np.float32)

for sparsity_level in [0.5, 0.9, 0.99]:
    sparse_w = magnitude_prune(weight, sparsity_level)
    dense_bytes = weight.nbytes
    sparse_bytes = sparse_w.data.nbytes + sparse_w.indices.nbytes + sparse_w.indptr.nbytes
    print(f"Sparsity {sparsity_level:.0%}: dense={dense_bytes/1e6:.1f} MB, "
          f"sparse={sparse_bytes/1e6:.1f} MB, "
          f"ratio={dense_bytes/sparse_bytes:.1f}x")

:::warning Sparse matrices are not always faster in practice Sparse-dense matrix multiplication often has poor cache performance due to irregular memory access patterns. For GPUs, dense operations with cuBLAS use tensor cores that achieve teraflop throughput - sparse equivalents may actually be slower until sparsity exceeds ~95%.

Modern solutions: structured sparsity (block-sparse, N:M sparsity supported by NVIDIA Ampere+ GPUs), or semi-structured sparsity where 50% of every 2×2 block is zero - supported natively by Ampere's sparse tensor cores. :::

Interview Questions

Q1: Explain the CSR format. What is stored in data, indices, and indptr?

CSR (Compressed Sparse Row) stores a sparse matrix in three arrays:

data[k]: The value of the $k$ -th non-zero entry (reading left-to-right, top-to-bottom)
indices[k]: The column index of the $k$ -th non-zero entry
indptr[i]: The position in data/indices where row $i$ begins; indptr[i+1] is where row $i$ ends

Example: Matrix $A$ with non-zeros at positions (0,2)=5, (1,0)=3, (1,2)=7, (2,1)=4:

data    = [5,   3,  7,   4]
indices = [2,   0,  2,   1]
indptr  = [0,   1,  3,   4]

Row 0 non-zeros: data[0:1] = [5] at columns indices[0:1] = [2] Row 1 non-zeros: data[1:3] = [3, 7] at columns indices[1:3] = [0, 2] Row 2 non-zeros: data[3:4] = [4] at columns indices[3:4] = [1]

Memory: $2 \times nnz \times \text{dtype\_size} + (n+1) \times \text{int\_size}$ - approximately twice the dense memory for just non-zeros.

Matrix-vector product efficiency: Each row $i$ computation accesses only data[indptr[i]:indptr[i+1]] and the corresponding columns of $x$ . For a dense row, this is $O(n)$ ; for a sparse row with $k$ non-zeros, it is $O(k)$ . Total: $O(nnz)$ vs $O(n^2)$ for dense.

Q2: When should you use CSR vs CSC vs COO, and how do you convert between them?

COO (Coordinate format): Use during construction. Stores (row, col, data) triplets. $O(nnz)$ for appending. Supports duplicates (values are summed on conversion). Use when building a matrix incrementally from a stream of (i, j, value) triplets.

CSR (Compressed Sparse Row): Use for:

Row operations: extracting row $i$ is $O(nnz_i)$ , fast
Matrix-vector products $Ax$ : row-oriented computation maps well to cache
Most sparse linear system solvers (after internal conversion)

CSC (Compressed Sparse Column): Use for:

Column operations: extracting column $j$ is $O(nnz_j)$ , fast
Left-multiplication $x^T A$ : equivalent to $A^T x$ on CSC
Sparse direct solvers (SuperLU, UMFPACK internally use CSC/CSR)

Conversion costs: All conversions are $O(nnz + m + n)$ - cheap.

import scipy.sparse as sp
A = sp.coo_matrix(...)      # Start here if building incrementally

A_csr = A.tocsr()            # For Ax operations - O(nnz + n)
A_csc = A.tocsc()            # For A^Tx operations - O(nnz + m)
A_coo = A_csr.tocoo()        # Back to COO

# Rule of thumb: always finish construction in COO or LIL,
# convert to CSR for all subsequent arithmetic.

Q3: How does sparsity in attention masks affect transformer training efficiency?

In a standard transformer with sequence length $T$ :

Full attention matrix: $T \times T$ - $O(T^2)$ memory and compute
For causal (autoregressive) attention: only the lower triangle is computed - 50% saving
For padding masks: padding tokens (typically 30-50% in NLP batches) don't need attention

Dense computation cost: Even with masking, naive dense attention computes all $T^2$ dot products and then sets masked positions to $-\infty$ . The softmax and value-weighted sum still process all positions.

Sparse attention approaches:

FlashAttention: Tile-based computation that never materializes the full $T^2$ matrix. Skips tiles where all entries are masked. Reduces memory from $O(T^2)$ to $O(T)$ .
Longformer/BigBird: Replace dense attention with sparse patterns (sliding window + global tokens). Reduces $O(T^2)$ to $O(T \cdot k)$ where $k$ is window size.
Block-sparse attention: Explicitly mask out entire $B \times B$ blocks. GPU can skip sparse blocks completely.

Memory impact: For $T = 2048$ , batch = 32, 16 heads:

Dense attention matrix: $32 \times 16 \times 2048^2 \times 4$ bytes = 8.6 GB
With 50% sparsity: 4.3 GB
FlashAttention: $O(T)$ - only ~200 MB

This is why long-context models (100K+ tokens) require sparse attention - dense attention would need terabytes of memory.

Q4: How is sparse matrix-vector multiplication more efficient than dense, and what are its limitations?

Dense matrix-vector product: $y = Ax$ where $A \in \mathbb{R}^{n \times n}$ requires $n^2$ multiplications - $O(n^2)$ operations.

Sparse matrix-vector product (CSR format):

for i in range(n):
    y[i] = sum(data[k] * x[indices[k]] for k in range(indptr[i], indptr[i+1]))

Total operations: $O(nnz)$ where $nnz \ll n^2$ for sparse matrices.

For density $\rho$ (fraction of non-zeros):

Speedup ratio: $n^2 / nnz = 1/\rho$
For $\rho = 0.01$ : 100× speedup
For $\rho = 0.0001$ : 10,000× speedup

Limitations:

Memory access patterns: Sparse formats access x[indices[k]] where indices are irregular. This causes cache misses - expensive on CPU/GPU. Dense operations use contiguous memory with predictable access patterns and benefit from BLAS GEMM optimizations.
GPU efficiency: Dense matrix multiplications use tensor cores (A100: 312 TFLOP/s for dense GEMM vs ~50 TFLOP/s for sparse). For sparsity below ~90%, sparse GPU kernels can be slower than dense.
Structured sparsity required for GPU speedup: NVIDIA's Ampere sparse tensor cores require N:M structured sparsity (e.g., 2:4 = 2 non-zeros per group of 4). Random unstructured sparsity doesn't benefit from hardware acceleration.
Overhead for small matrices: For matrices smaller than ~1000×1000, sparse format overhead (pointer chasing, indirection) can make sparse slower than dense.

Q5: What is the Graph Laplacian and why is it always stored as a sparse matrix?

Graph Laplacian: For a graph $G = (V, E)$ with adjacency matrix $A$ and degree matrix $D$ (diagonal matrix with $D_{ii} = \sum_j A_{ij}$ ):

$L = D - A$

Properties:

Symmetric: $L = L^T$
Positive semi-definite: $x^T L x = \sum_{(i,j) \in E} (x_i - x_j)^2 \geq 0$
$L_{ii} = \text{degree}(i)$ , $L_{ij} = -1$ if $(i,j) \in E$ , else 0
$L\mathbf{1} = 0$ (constant functions in the null space)

Why sparse: A graph with $n$ nodes and $m$ edges has $L$ with exactly $n + 2m$ non-zeros (diagonal + two symmetric off-diagonal entries per edge). For sparse graphs where $m = O(n)$ (each node has bounded degree):

$nnz(L) = O(n)$ vs $n^2$ for dense
Storage: $O(n)$ vs $O(n^2)$

For the Cora citation network: $n = 2708$ , $m = 5429$ , so $nnz = 2708 + 10858 = 13566$ - vs $2708^2 = 7,333,264$ for dense. Sparse saves 540× memory.

ML use cases:

Spectral clustering: solve $Lv = \lambda v$ for smallest eigenvalues → stored sparse, eigsh finds k smallest efficiently
GCN propagation: $H_{new} = \tilde{L} H W$ where $\tilde{L}$ is normalized Laplacian
Manifold learning (Laplacian Eigenmaps): graph Laplacian from k-NN graph
Semi-supervised learning: label propagation using graph Laplacian regularization

Quick Reference

Format	Best for	Row access	Col access	Arithmetic
COO	Construction	Slow	Slow	None
CSR	$Ax$ , row slicing	$O(nnz_i)$	Slow	Fast
CSC	$A^Tx$ , col slicing	Slow	$O(nnz_j)$	Fast
LIL	Incremental fill	Fast	Slow	None
DOK	Random access	Slow	Slow	None
BSR	Block-structured	Block	Block	Fast

Sparsity	Dense vs Sparse	GPU recommendation
< 50%	Dense faster	Always use dense
50–90%	Similar	Dense usually faster on GPU
90–99%	Sparse faster	Use structured sparsity for GPU
> 99%	Sparse much faster	Always sparse

Module 08 complete - next: Module 09: Graph Theory →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sparse Matrix Methods demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Density of Real ML Matrices​

Part 2 - Sparse Storage Formats​

COO (Coordinate Format)​

CSR (Compressed Sparse Row)​

CSC (Compressed Sparse Column)​

Other formats​

Format selection guide​

Part 3 - Sparse Operations in SciPy​

Part 4 - Attention Masks in Transformers​

Dense vs sparse attention masks​

Sparse padding mask​

Part 5 - Sparse Graph Adjacency Matrices​

Part 6 - Sparse Embeddings in Recommendation Systems​

Part 7 - Sparse Weight Matrices in Pruned Networks​

Interview Questions​

Quick Reference​