Skip to main content

Hierarchical Clustering

import ReadingTime from '@site/src/components/ReadingTime';

:::note Interview Relevance - High Hierarchical clustering appears in interviews for roles that touch NLP (document taxonomy), bioinformatics (gene expression clustering), recommendation systems, and any domain where the number of clusters is unknown. The dendrogram interpretation, linkage trade-offs, and scalability strategies are high-frequency topics. :::

The Real Interview Moment

You have just joined a health-tech startup that is building a drug discovery platform. The genomics team has run RNA sequencing on 800 tumor samples from 14 different cancer types. Each sample has expression readings for 20,000 genes. The lead scientist asks you: "Can you cluster these samples? We want to see if there are subtypes we haven't discovered yet - subtypes that cut across traditional cancer type labels."

You open your laptop. K-means? The scientist immediately asks, "How many clusters?" You don't know - that's the whole point. DBSCAN? The data is high-dimensional and you'd need to tune the epsilon parameter for 20,000 dimensions. Then it clicks: hierarchical clustering. Run it once, produce a dendrogram, and let the scientist choose the granularity. Two big clusters at the top level might separate slow-growing from fast-growing tumors. Cut lower and you get 14 clusters matching the known cancer types. Cut lower still and you find a previously unknown subtype - 12 samples that cluster tightly together and don't fit any of the 14 known labels. That's a paper.

This lesson covers the algorithm that made that discovery possible - how it works, why the linkage criterion is everything, and how to apply it in production without it grinding to a halt on large datasets.

Why This Exists - The Problem K-Means Cannot Solve

K-means requires you to specify KK before you start. This is a fundamental limitation when exploring data: you don't know how many clusters exist. You have to run K-means for K=2,3,,20K = 2, 3, \ldots, 20, look at an elbow plot, and guess. And even then, K-means only gives you one flat partition - no information about which clusters are similar to each other.

Hierarchical clustering solves both problems:

  1. No need to specify K - run once, get the full spectrum from 1 cluster to nn clusters
  2. Reveals structure at multiple scales - a dendrogram shows not just "how many clusters" but "which clusters are similar to each other"
  3. Deterministic - unlike K-means, no random initialization means reproducible results every time

The cost is computational: K-means is O(nKd)O(n \cdot K \cdot d) per iteration. Hierarchical clustering is O(n3)O(n^3) naive or O(n2logn)O(n^2 \log n) with optimizations. For large nn (millions), you need approximations.

Two Flavors: Agglomerative vs Divisive

Agglomerative (Bottom-Up)

Start with every data point as its own cluster (nn clusters). Repeatedly merge the two most similar clusters until everything belongs to one cluster. This produces a tree - a dendrogram - recording the full merge history.

Computational complexity: O(n3)O(n^3) naive (check all pairs at each step), O(n2logn)O(n^2 \log n) with a priority queue and the Lance-Williams update formula.

When to use: 99% of the time. The standard, well-optimized, widely supported approach.

Divisive (Top-Down)

Start with all nn points in one cluster. Repeatedly split the most heterogeneous cluster until each point stands alone.

DIANA (DIvisive ANAlysis) is the classic algorithm:

  1. Start with all points in one cluster
  2. Find the point with the highest average dissimilarity to all others - this becomes the first "splinter" cluster
  3. Move each remaining point to the splinter cluster if it is closer to the splinter than to the main cluster
  4. Repeat until the split produces two final clusters
  5. Recurse on the largest cluster

Computational complexity: O(2n)O(2^n) for optimal splits (exponential - not practical). DIANA uses greedy approximations, reducing to O(n2)O(n^2) per level, but with O(n)O(n) levels in the worst case.

When to use: Rarely. Only when you specifically need top-down structure and can tolerate the approximations in DIANA. In most production code, agglomerative with Ward linkage is the default.

Linkage Criteria - The Critical Design Choice

In agglomerative clustering, at each step you merge the two clusters with the smallest inter-cluster distance. But "distance between two clusters" is ambiguous - a cluster is a set of points, not a single point. Linkage criteria define how to measure this.

Single Linkage

d(Ci,Cj)=minxCi,yCjd(x,y)d(C_i, C_j) = \min_{x \in C_i,\, y \in C_j} d(x, y)

The distance between two clusters is the distance between their closest pair of points.

  • Advantage: Can discover non-spherical, elongated, filamentary clusters. It finds chains and manifolds.
  • Disadvantage: Prone to the chaining effect - a single point bridging two large groups causes them to merge prematurely. Very sensitive to noise.
  • Use case: Non-convex cluster shapes, phylogenetic trees, detecting strings/chains in spatial data.

Complete Linkage

d(Ci,Cj)=maxxCi,yCjd(x,y)d(C_i, C_j) = \max_{x \in C_i,\, y \in C_j} d(x, y)

The distance between two clusters is the distance between their most distant pair of points.

  • Advantage: Resists chaining. Produces compact, roughly equal-diameter clusters.
  • Disadvantage: A single outlier in a cluster dramatically increases its distance to all other clusters - very sensitive to outliers.
  • Use case: Document clustering, situations where clusters must be tight and well-separated.

Average Linkage (UPGMA)

d(Ci,Cj)=1CiCjxCiyCjd(x,y)d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)

The distance is the average distance over all cross-cluster point pairs.

  • Advantage: Compromise between single and complete. More robust to outliers than complete linkage.
  • Disadvantage: Less interpretable geometric meaning. Computationally heavier to update naively (O(CiCj)O(|C_i| \cdot |C_j|) per distance computation).
  • Use case: Bioinformatics (UPGMA is the standard for building phylogenetic trees from distance matrices).

Ward Linkage - The Standard Choice

Ward's method does not use a distance between clusters at all. Instead, it minimizes the increase in within-cluster sum of squares (WCSS) at each merge.

The Ward distance between clusters CiC_i and CjC_j is:

Δ(Ci,Cj)=CiCjCi+Cjμiμj2\Delta(C_i, C_j) = \frac{|C_i||C_j|}{|C_i| + |C_j|} \|\mu_i - \mu_j\|^2

where μi\mu_i and μj\mu_j are the cluster centroids. This is the increase in WCSS that would result from merging CiC_i and CjC_j.

At each step, Ward merges the pair that minimizes Δ(Ci,Cj)\Delta(C_i, C_j).

Why Ward is the default:

  • Same objective as K-means (minimize WCSS) - Ward hierarchical clustering and K-means often give similar results
  • Produces compact, roughly equal-sized, well-separated clusters
  • Most interpretable: you can directly see the WCSS increase as the dendrogram y-axis value
  • No chaining effect
  • Well-supported in scipy and sklearn

Limitation: Only valid for Euclidean distance. You cannot use Ward with cosine distance or other non-Euclidean metrics. For text data or other domains where cosine similarity is natural, use average or complete linkage.

Linkage Decision Tree:

Data has non-spherical clusters?

├─ Yes → Single linkage (but beware noise)

└─ No → Is your data high-dimensional text/sparse?

├─ Yes → Average or Complete linkage (non-Euclidean OK)

└─ No → Ward linkage (default choice)

The Lance-Williams Update Formula

All four linkage criteria can be expressed as instances of the same update formula, discovered by Lance and Williams (1967). After merging clusters CiC_i and CjC_j into CijC_{ij}, the distance from CijC_{ij} to any other cluster CkC_k is:

d(Cij,Ck)=αid(Ci,Ck)+αjd(Cj,Ck)+βd(Ci,Cj)+γd(Ci,Ck)d(Cj,Ck)d(C_{ij}, C_k) = \alpha_i \cdot d(C_i, C_k) + \alpha_j \cdot d(C_j, C_k) + \beta \cdot d(C_i, C_j) + \gamma |d(C_i, C_k) - d(C_j, C_k)|

The parameters (αi,αj,β,γ)(\alpha_i, \alpha_j, \beta, \gamma) differ by linkage:

Linkageαi\alpha_iαj\alpha_jβ\betaγ\gamma
Single1/21/21/21/2001/2-1/2
Complete1/21/21/21/200+1/2+1/2
AverageCiCij\frac{\|C_i\|}{\|C_{ij}\|}CjCij\frac{\|C_j\|}{\|C_{ij}\|}0000
WardCi+CkCij+Ck\frac{\|C_i\|+\|C_k\|}{\|C_{ij}\|+\|C_k\|}Cj+CkCij+Ck\frac{\|C_j\|+\|C_k\|}{\|C_{ij}\|+\|C_k\|}CkCij+Ck\frac{-\|C_k\|}{\|C_{ij}\|+\|C_k\|}00

This formula allows efficient O(n2)O(n^2) update: instead of recomputing all pairwise distances from scratch after each merge, you can update the distance matrix in-place using only the existing distances. This is why scipy's implementation is efficient.

The Dendrogram: Reading It Correctly

How to read the dendrogram:

  • Y-axis (height): the linkage distance at which two clusters merged. Higher = more dissimilar clusters.
  • Large vertical gaps: regions where no merges happen over a large range of heights. A large gap before height hh means that cutting at hh separates very natural clusters (they were close together internally but far from each other).
  • Optimal cut: look for the largest vertical gap in the dendrogram. Cut just below it to get the most natural flat clustering.
  • Number of clusters: after cutting at height hh, count the number of distinct branches below the cut line.

Reading dendrogram gaps programmatically:

import numpy as np
from scipy.cluster.hierarchy import linkage
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=200, centers=4, cluster_std=0.8, random_state=42)
Z = linkage(X, method='ward')

# Z[:, 2] contains the merge heights
merge_heights = Z[:, 2]

# Gaps between consecutive merges
gaps = np.diff(merge_heights)

# Largest gap → the "natural" cut point
max_gap_idx = np.argmax(gaps)
natural_cut = merge_heights[max_gap_idx]
n_clusters_natural = len(merge_heights) - max_gap_idx # clusters below this cut

print(f"Merge heights (last 10): {merge_heights[-10:].round(3)}")
print(f"Largest gap: {gaps[max_gap_idx]:.3f}")
print(f"Natural cut height: {natural_cut:.3f}")
print(f"Suggested K: {n_clusters_natural}")

Full NumPy Implementation: Distance Matrix and Merge Steps

Understanding the algorithm at the matrix level is essential for interviews. Here is agglomerative clustering from scratch using a distance matrix:

import numpy as np
from scipy.spatial.distance import cdist

def agglomerative_single_linkage(X: np.ndarray) -> list:
"""
Single-linkage agglomerative clustering from scratch.
Returns merge history: list of (cluster_i, cluster_j, distance, new_size)

For demonstration - O(n^3) naive implementation.
scipy.cluster.hierarchy.linkage() is the production version.
"""
n = X.shape[0]

# Pairwise distance matrix
dist = cdist(X, X, metric='euclidean')
np.fill_diagonal(dist, np.inf) # don't merge a point with itself

# Each point starts as its own cluster
# clusters[i] = list of point indices in cluster i
clusters = {i: [i] for i in range(n)}
active = set(range(n)) # currently-active cluster IDs
next_id = n # next cluster ID to assign after a merge

merge_history = []

# Distance matrix indexed by cluster_id
# We'll use a dict-of-dicts for flexibility
D = {}
for i in range(n):
D[i] = {}
for j in range(n):
if i != j:
D[i][j] = dist[i, j]

for _ in range(n - 1):
# Find closest pair among active clusters
min_dist = np.inf
merge_a, merge_b = -1, -1

active_list = list(active)
for i in range(len(active_list)):
for j in range(i + 1, len(active_list)):
ci, cj = active_list[i], active_list[j]
if D[ci][cj] < min_dist:
min_dist = D[ci][cj]
merge_a, merge_b = ci, cj

# Record the merge
new_size = len(clusters[merge_a]) + len(clusters[merge_b])
merge_history.append((merge_a, merge_b, min_dist, new_size))

# Create new cluster
clusters[next_id] = clusters[merge_a] + clusters[merge_b]
D[next_id] = {}

# Update distances to the new cluster (single linkage = min)
for k in active:
if k != merge_a and k != merge_b:
D[next_id][k] = min(D[merge_a][k], D[merge_b][k])
D[k][next_id] = D[next_id][k]

# Remove merged clusters, add new one
active.discard(merge_a)
active.discard(merge_b)
active.add(next_id)
next_id += 1

return merge_history


# Example: 5 points
np.random.seed(42)
X_small = np.array([[0, 0], [1, 0.5], [0.5, 1], [5, 5], [5.5, 4.5]])

history = agglomerative_single_linkage(X_small)
print("Merge history (cluster_i, cluster_j, distance, new_size):")
for step in history:
print(f" Merge clusters {step[0]} and {step[1]} at distance {step[2]:.3f} → size {step[3]}")

Scipy Implementation: Dendrograms and Cutting

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, cophenet
from scipy.spatial.distance import pdist
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# ── Generate data ──────────────────────────────────────────────────────────────
X, y_true = make_blobs(n_samples=200, centers=5, cluster_std=0.8, random_state=42)
X = StandardScaler().fit_transform(X)

# ── Compute linkage matrix ─────────────────────────────────────────────────────
# Z shape: (n-1, 4) - each row: [cluster_i, cluster_j, distance, size_new_cluster]
Z = linkage(X, method='ward', metric='euclidean')

print(f"Linkage matrix shape: {Z.shape}")
print(f"First few merges:\n{Z[:5].round(4)}")
print(f"Last few merges (highest-level):\n{Z[-5:].round(4)}")

# ── Cophenetic correlation ─────────────────────────────────────────────────────
# Measures how faithfully the dendrogram preserves pairwise distances
# Range: 0 to 1 - above 0.75 is considered a good fit
c_corr, cophenetic_dists = cophenet(Z, pdist(X))
print(f"\nCophenetic correlation coefficient: {c_corr:.4f}")

# ── Plot dendrogram ────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Full dendrogram
dendrogram(
Z,
ax=axes[0],
color_threshold=6.0, # colors subtrees below this height differently
leaf_rotation=90,
leaf_font_size=5,
above_threshold_color='gray',
)
axes[0].set_title("Ward Linkage Dendrogram (200 points)")
axes[0].set_xlabel("Sample Index")
axes[0].set_ylabel("Ward Distance (WCSS increase)")
# Draw a horizontal cut line
axes[0].axhline(y=6.0, c='red', linestyle='--', linewidth=1.5, label='Cut at 6.0')
axes[0].legend()

# Truncated dendrogram (readable for large n)
dendrogram(
Z,
ax=axes[1],
truncate_mode='lastp', # show only the last p merges
p=15, # last 15 merges
leaf_rotation=45,
show_contracted=True, # show sizes of contracted subtrees in parentheses
show_leaf_counts=True,
)
axes[1].set_title("Truncated Dendrogram (last 15 merges)")
axes[1].set_xlabel("Cluster (number of points in parentheses)")
axes[1].set_ylabel("Ward Distance")

plt.tight_layout()
plt.show()

Cutting the Dendrogram: Two Methods

from scipy.cluster.hierarchy import fcluster
import numpy as np

# ── Method 1: Cut at a distance threshold ─────────────────────────────────────
# All merges above height t form distinct clusters
labels_by_distance = fcluster(Z, t=6.0, criterion='distance')
print(f"Clusters from distance cut at 6.0: {len(np.unique(labels_by_distance))}")

# ── Method 2: Request exactly K clusters ──────────────────────────────────────
for K in [3, 4, 5, 6]:
labels_k = fcluster(Z, t=K, criterion='maxclust')
print(f"K={K}: cluster sizes = {np.bincount(labels_k[labels_k > 0])}")

# ── Method 3: Automatic cut via largest gap ────────────────────────────────────
merge_heights = Z[:, 2]
gaps = np.diff(merge_heights)
top3_gaps = np.argsort(gaps)[-3:][::-1]

print("\nTop 3 natural cut points:")
for idx in top3_gaps:
cut_height = merge_heights[idx]
n_clusters = len(merge_heights) - idx
print(f" Height={cut_height:.3f}, K={n_clusters}, gap_size={gaps[idx]:.3f}")

Cophenetic Correlation - Measuring Dendrogram Quality

The cophenetic correlation coefficient measures how faithfully the dendrogram preserves the original pairwise distances. For each pair of points i,ji, j, the cophenetic distance is the height at which they first merge in the dendrogram.

c=i<j(dijdˉ)(cijcˉ)i<j(dijdˉ)2i<j(cijcˉ)2c = \frac{\sum_{i<j}(d_{ij} - \bar{d})(c_{ij} - \bar{c})}{\sqrt{\sum_{i<j}(d_{ij}-\bar{d})^2 \cdot \sum_{i<j}(c_{ij}-\bar{c})^2}}

where dijd_{ij} is the original Euclidean distance and cijc_{ij} is the cophenetic (dendrogram) distance.

Interpretation:

  • c>0.75c > 0.75: good dendrogram - the hierarchy faithfully represents the data structure
  • c0.5c \approx 0.50.750.75: moderate - the linkage may not be the best choice for this data
  • c<0.5c < 0.5: poor - the dendrogram is misleading; try a different linkage criterion
from scipy.cluster.hierarchy import cophenet, linkage
from scipy.spatial.distance import pdist
import numpy as np

# Compare cophenetic correlation across linkage methods
X_data = ... # your data, shape (n, d)
dist_condensed = pdist(X_data, metric='euclidean')

for method in ['ward', 'complete', 'average', 'single']:
Z_m = linkage(X_data, method=method)
c, _ = cophenet(Z_m, dist_condensed)
print(f"{method:10s} cophenetic correlation = {c:.4f}")

# Interpretation: the linkage with the highest cophenetic correlation
# is the one whose dendrogram most accurately represents the data.

Chaining Effect in Single Linkage

The chaining effect is the most important pathology in hierarchical clustering. With single linkage, a thin bridge of points between two large clusters - even a single intermediate point - can cause them to merge at a much lower height than their "true" separation would suggest.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Create two well-separated clusters with a single bridging point
rng = np.random.default_rng(42)
cluster1 = rng.normal(loc=[0, 0], scale=0.5, size=(40, 2))
cluster2 = rng.normal(loc=[5, 5], scale=0.5, size=(40, 2))
bridge = np.array([[2.5, 2.5]]) # single bridging point
X_chaining = np.vstack([cluster1, cluster2, bridge])

# Cluster with single linkage - chain effect!
Z_single = linkage(X_chaining, method='single')
Z_ward = linkage(X_chaining, method='ward')

fig, axes = plt.subplots(1, 3, figsize=(21, 6))

# Dendrogram - single linkage
dendrogram(Z_single, ax=axes[0], color_threshold=0.8, leaf_font_size=6)
axes[0].set_title("Single Linkage Dendrogram\n(chaining - bridge forces early merge)")
axes[0].set_ylabel("Distance")

# Dendrogram - Ward linkage
dendrogram(Z_ward, ax=axes[1], color_threshold=4.0, leaf_font_size=6)
axes[1].set_title("Ward Linkage Dendrogram\n(clean separation)")
axes[1].set_ylabel("Ward Distance")

# Scatter: single linkage coloring
labels_s = AgglomerativeClustering(n_clusters=2, linkage='single').fit_predict(X_chaining)
labels_w = AgglomerativeClustering(n_clusters=2, linkage='ward').fit_predict(X_chaining)
axes[2].scatter(X_chaining[labels_w==0, 0], X_chaining[labels_w==0, 1],
c='steelblue', label='Ward cluster 0')
axes[2].scatter(X_chaining[labels_w==1, 0], X_chaining[labels_w==1, 1],
c='salmon', label='Ward cluster 1')
axes[2].scatter(*bridge.T, c='red', s=100, zorder=5, marker='*', label='Bridge point')
axes[2].set_title("Data: Bridge Point Causes\nSingle-Linkage Chaining")
axes[2].legend()

plt.tight_layout()
plt.show()

Linkage Comparison on Real Shapes

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_moons, make_circles
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

datasets = {
'Blobs (compact)': make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)[0],
'Moons (non-convex)': make_moons(n_samples=300, noise=0.1, random_state=42)[0],
'Circles': make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)[0],
}
linkages = ['ward', 'complete', 'average', 'single']

fig, axes = plt.subplots(len(datasets), len(linkages), figsize=(20, 12))

for row, (name, X_d) in enumerate(datasets.items()):
for col, method in enumerate(linkages):
ax = axes[row, col]

metric = 'euclidean'
agg = AgglomerativeClustering(n_clusters=2, metric=metric, linkage=method)
labels = agg.fit_predict(X_d)
sil = silhouette_score(X_d, labels)

ax.scatter(X_d[:, 0], X_d[:, 1], c=labels, cmap='bwr', s=10, alpha=0.7)
ax.set_title(f"{method}\nsil={sil:.3f}", fontsize=9)
if col == 0:
ax.set_ylabel(name, fontsize=10)
ax.axis('off')

plt.suptitle("Linkage Method × Dataset Shape", fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

Key takeaways from this comparison:

  • Ward is best for compact, globular clusters (blobs)
  • Single linkage is best for elongated chains and manifolds (moons, circles) - but degrades badly with noise
  • Complete linkage is robust but biased toward equal-diameter clusters

Time and Space Complexity

AlgorithmTimeSpaceNotes
Naive agglomerativeO(n3)O(n^3)O(n2)O(n^2)Recompute all pairwise distances each merge
With priority queue (scipy)O(n2logn)O(n^2 \log n)O(n2)O(n^2)Lance-Williams formula + heap
SLINK (single linkage)O(n2)O(n^2)O(n)O(n)Sibson's optimal single-linkage algorithm
CLINK (complete linkage)O(n2)O(n^2)O(n)O(n)Defays' algorithm - same efficiency
Divisive (DIANA)O(n2)O(n^2) per levelO(n2)O(n^2)Rarely used in practice

Memory wall: The distance matrix for n=10,000n = 10{,}000 points requires 10,0002×810{,}000^2 \times 8 bytes =800= 800 MB of RAM. For n=100,000n = 100{,}000, it is 80 GB - impossible on a single machine.

Scaling Strategies for Large Datasets

import numpy as np
from sklearn.cluster import AgglomerativeClustering, Birch, MiniBatchKMeans
from sklearn.neighbors import kneighbors_graph
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.preprocessing import StandardScaler

def hierarchical_large_scale(X: np.ndarray, n_clusters: int,
max_points: int = 5000,
method: str = 'two_stage') -> np.ndarray:
"""
Scale hierarchical clustering to large datasets.

method='sample': cluster a sample, assign rest by nearest centroid
method='two_stage': K-means → centroids → hierarchical on centroids
method='connectivity': sparse connectivity to reduce computation
"""
n = X.shape[0]

if method == 'sample' and n > max_points:
# --- Strategy 1: Sample + Assign --------------------------------
# Sample max_points representative points
idx = np.random.choice(n, max_points, replace=False)
X_sample = X[idx]

# Run hierarchical on sample
Z = linkage(X_sample, method='ward')
sample_labels = fcluster(Z, t=n_clusters, criterion='maxclust') - 1

# Compute centroids of sample clusters
centroids = np.array([
X_sample[sample_labels == k].mean(axis=0)
for k in range(n_clusters)
])

# Assign all points to nearest centroid
from sklearn.metrics import pairwise_distances_argmin
all_labels = pairwise_distances_argmin(X, centroids)

return all_labels

elif method == 'two_stage' and n > max_points:
# --- Strategy 2: K-means → Centroids → Hierarchical ---------------
# Step 1: Over-cluster with K-means (n_init=1 for speed)
n_pre_clusters = min(max_points, n // 5)
kmeans = MiniBatchKMeans(n_clusters=n_pre_clusters, n_init=3, random_state=42)
pre_labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# Step 2: Hierarchical on centroids (small matrix: max_points × max_points)
Z = linkage(centroids, method='ward')
centroid_labels = fcluster(Z, t=n_clusters, criterion='maxclust') - 1

# Step 3: Map original points through pre-cluster → centroid labels
all_labels = centroid_labels[pre_labels]

return all_labels

elif method == 'connectivity':
# --- Strategy 3: Sparse connectivity constraint -------------------
# Build k-NN graph - only nearby points can merge
connectivity = kneighbors_graph(X, n_neighbors=15, include_self=False)

agg = AgglomerativeClustering(
n_clusters=n_clusters,
connectivity=connectivity,
linkage='ward'
)
return agg.fit_predict(X)

else:
# Small enough: run directly
agg = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
return agg.fit_predict(X)


# Birch for very large datasets (millions of points)
def birch_clustering(X: np.ndarray, n_clusters: int) -> np.ndarray:
"""
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies.
Scales to millions of points via CF-tree pre-compression.

threshold: radius of each leaf node (smaller = more nodes = higher memory)
branching_factor: max children per internal node
"""
birch = Birch(
threshold=0.5,
branching_factor=50,
n_clusters=n_clusters # applies hierarchical clustering to leaf centroids
)
return birch.fit_predict(X)

Bioinformatics Application: Gene Expression Heatmap

Hierarchical clustering on gene expression data is one of the most cited applications in the literature. The standard visualization is a heatmap with dendrograms on both axes - clustering both samples (columns) and genes (rows).

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform

def gene_expression_heatmap(expression_matrix: np.ndarray,
sample_labels: list = None,
gene_labels: list = None,
linkage_method: str = 'average',
figsize: tuple = (14, 10)):
"""
Hierarchical clustering of gene expression data with double dendrogram.

expression_matrix: (n_genes, n_samples) array of log2 fold changes
linkage_method: 'average' (UPGMA) is standard in bioinformatics
"""
n_genes, n_samples = expression_matrix.shape

if sample_labels is None:
sample_labels = [f"Sample_{i}" for i in range(n_samples)]
if gene_labels is None:
gene_labels = [f"Gene_{i}" for i in range(n_genes)]

# ── Cluster samples (columns) ──────────────────────────────────────────
# Use 1 - correlation as distance (bioinformatics standard)
# Correlation distance = 1 - pearson_r - captures co-expression patterns
sample_corr = np.corrcoef(expression_matrix.T) # (n_samples, n_samples)
sample_dist = 1 - sample_corr
np.fill_diagonal(sample_dist, 0)
sample_dist_condensed = squareform(sample_dist, checks=False)
Z_samples = linkage(sample_dist_condensed, method=linkage_method)

# ── Cluster genes (rows) ───────────────────────────────────────────────
gene_corr = np.corrcoef(expression_matrix) # (n_genes, n_genes)
gene_dist = 1 - gene_corr
np.fill_diagonal(gene_dist, 0)
gene_dist_condensed = squareform(gene_dist, checks=False)
Z_genes = linkage(gene_dist_condensed, method=linkage_method)

# ── Get leaf orderings ─────────────────────────────────────────────────
from scipy.cluster.hierarchy import leaves_list
sample_order = leaves_list(Z_samples)
gene_order = leaves_list(Z_genes)

# ── Reorder matrix ─────────────────────────────────────────────────────
X_reordered = expression_matrix[gene_order][:, sample_order]

# ── Plot ───────────────────────────────────────────────────────────────
fig = plt.figure(figsize=figsize)
gs = fig.add_gridspec(2, 2, width_ratios=[1, 5], height_ratios=[1, 5], hspace=0.02, wspace=0.02)

ax_col_dendro = fig.add_subplot(gs[0, 1]) # column (sample) dendrogram
ax_row_dendro = fig.add_subplot(gs[1, 0]) # row (gene) dendrogram
ax_heatmap = fig.add_subplot(gs[1, 1]) # main heatmap
fig.add_subplot(gs[0, 0]).set_visible(False) # empty corner

# Column dendrogram
dendrogram(Z_samples, ax=ax_col_dendro, orientation='top',
color_threshold=0.5, no_labels=True)
ax_col_dendro.set_title("Sample Clustering", fontsize=11)
ax_col_dendro.axis('off')

# Row dendrogram
dendrogram(Z_genes, ax=ax_row_dendro, orientation='left',
color_threshold=0.5, no_labels=True)
ax_row_dendro.axis('off')

# Heatmap
vmax = np.percentile(np.abs(X_reordered), 95)
im = ax_heatmap.imshow(X_reordered, aspect='auto', cmap='RdBu_r',
vmin=-vmax, vmax=vmax, interpolation='nearest')

# Labels
ax_heatmap.set_xticks(range(n_samples))
ax_heatmap.set_xticklabels([sample_labels[i] for i in sample_order],
rotation=90, fontsize=7)
ax_heatmap.set_yticks(range(n_genes))
ax_heatmap.set_yticklabels([gene_labels[i] for i in gene_order], fontsize=7)

plt.colorbar(im, ax=ax_heatmap, fraction=0.046, pad=0.04, label='log2 FC')

return fig, {'sample_order': sample_order, 'gene_order': gene_order,
'Z_samples': Z_samples, 'Z_genes': Z_genes}


# Simulate gene expression data (20 genes, 30 samples)
np.random.seed(42)
n_g, n_s = 20, 30
expression = np.random.randn(n_g, n_s)
# Create a block structure: genes 0-9 co-expressed in samples 0-14
expression[:10, :15] += 2.0
expression[10:, 15:] -= 1.5

sample_names = [f"Tumor_{i}" for i in range(15)] + [f"Normal_{i}" for i in range(15)]
gene_names = [f"GENE_{chr(65+i)}" for i in range(20)]

fig, cluster_info = gene_expression_heatmap(
expression, sample_labels=sample_names, gene_labels=gene_names,
linkage_method='average'
)
plt.suptitle("Gene Expression Heatmap with Hierarchical Clustering", fontsize=13, y=1.01)
plt.show()

K-Means vs Hierarchical Clustering: Decision Guide

PropertyK-MeansAgglomerative
Requires specifying KYes - must specify upfrontNo - choose K by cutting dendrogram
OutputSingle flat partitionFull hierarchy (dendrogram)
Algorithm typeIterative optimizationGreedy merge
ReproducibilityRandom init → nondeterministicDeterministic (given linkage)
Cluster shapesSpherical onlyDepends on linkage - single can find chains
Scales toMillions of points~50K points without approximations
Cluster hierarchyNoYes - reveals multi-scale structure
SpeedO(nKditers)O(n \cdot K \cdot d \cdot \text{iters})O(n2logn)O(n^2 \log n)
Best forLarge datasets, known KExploration, unknown K, need hierarchy

:::danger Do Not Scale Features Separately Per Cluster Always scale features globally (using StandardScaler on the entire training set) before running hierarchical clustering. If you accidentally scale features separately within each cluster, you are fitting the scaler on information from the test set's cluster assignments - a form of data leakage. Fit the scaler on the training data only, then transform all data. :::

:::warning Ward Linkage Requires Euclidean Distance Ward linkage is only mathematically valid with Euclidean distance. Do not pass metric='cosine' with linkage='ward' - sklearn will raise an error. For non-Euclidean metrics (cosine, correlation, Jaccard), use complete or average linkage. For text data, computing a cosine distance matrix first and then running average linkage is the standard approach. :::

YouTube Resources

ChannelVideo TitleWhy Watch
StatQuest with Josh StarmerHierarchical ClusteringClearest dendrogram explanation, visual intuition for Ward vs others
ritvikmathHierarchical ClusteringMath behind linkage criteria, Lance-Williams formula
Serrano.AcademyUnsupervised LearningFull unsupervised learning survey including HC
MIT OpenCourseWareClustering MethodsTheoretical foundations, complexity analysis

Interview Q&A

Q1: What is the key difference between agglomerative clustering and K-means? When would you choose one over the other?

K-means requires K specified upfront, uses random initialization (nondeterministic), produces a single flat partition, and assumes spherical clusters. Agglomerative clustering needs no K (you choose granularity by cutting the dendrogram), is deterministic, produces a full hierarchy, and - with the right linkage - can discover non-spherical clusters. Choose agglomerative when: (1) you don't know K and want to explore, (2) you need a hierarchy (e.g., product taxonomy, phylogenetic tree), (3) your dataset is small enough (n50,000n \lesssim 50{,}000). Choose K-means when: (1) you have millions of points, (2) you know K approximately, (3) speed matters. In practice, a hybrid approach works well: K-means on 500 centroids, then Ward hierarchical on the 500 centroids.

Q2: When would you choose single linkage vs Ward linkage? What are the risks of each?

Single linkage is the distance between the closest pair of points across two clusters. It can discover elongated, filamentary, and non-convex clusters - it "chains" through a manifold. The risk is the chaining effect: a single bridging point between two large groups causes them to merge at a low height, even though they are actually well-separated globally. Ward linkage minimizes the increase in within-cluster sum of squares at each merge - it produces compact, roughly equal-sized, well-separated clusters. The risk is it fails on non-spherical clusters; crescent or ring-shaped clusters will be cut incorrectly. Rule of thumb: Ward for compact clusters (the default in 95% of cases), single for chains/manifolds.

Q3: How do you determine the optimal number of clusters from a dendrogram?

Look for the largest vertical gap (largest jump in merge height) on the dendrogram y-axis. A large gap means the two clusters being merged are very dissimilar - they should stay separate. Cutting just below this gap gives the most "natural" flat clustering. Programmatically: gaps = np.diff(Z[:, 2]), then K = len(Z) - np.argmax(gaps). You can also look at the top-3 gaps to see if there are multiple plausible cut points, corresponding to different levels of granularity. Supplement with the silhouette score: compute it for K from 2 to 10 by cutting the dendrogram at each level, and pick the K with the highest silhouette.

Q4: Hierarchical clustering is O(n2)O(n^2) in space. How would you apply it to 10 million points?

Running standard agglomerative clustering on 10M points is infeasible - the distance matrix alone requires (107)2×8(10^7)^2 \times 8 bytes = 800TB. The standard approaches: (1) Two-stage: run MiniBatch K-means to get 1,000–5,000 representative centroids, then run Ward hierarchical on the centroids. The 1,000-centroid matrix is trivially small. Map original points back through K-means cluster assignments. (2) BIRCH: pre-compresses data into a compact CF-tree and applies hierarchical clustering to the tree's leaf nodes. Scales to millions. (3) Sparse connectivity: build a k-NN graph (sparse, not full), and restrict merges to adjacent clusters. sklearn's AgglomerativeClustering(connectivity=...) implements this - reduces effective O(n2)O(n^2) to O(nk)O(n \cdot k).

Q5: What is the cophenetic correlation coefficient and what does it tell you?

The cophenetic correlation measures how faithfully a dendrogram preserves the original pairwise distances. For each pair of points, the cophenetic distance is the height at which they first merge in the dendrogram. The coefficient is the Pearson correlation between all original distances and all cophenetic distances. Range: 0 to 1. A value above 0.75 indicates the dendrogram is a good representation of the data's distance structure - cutting it will produce meaningful clusters. Below 0.5, the dendrogram is misleading and you should try a different linkage criterion. Use it to compare linkage methods: run cophenet() from scipy for ward, complete, average, and single, then choose the one with the highest coefficient.

Q6: Your team is building a gene expression clustering pipeline for 800 tumor samples. Which linkage criterion would you choose and why?

Average linkage (UPGMA) - the standard in bioinformatics. Ward requires Euclidean distance, but gene expression data is typically analyzed using correlation distance (1rij1 - r_{ij}) because correlation captures co-expression patterns regardless of scale. Single linkage is too susceptible to noise (individual outlier cells produce chaining). Complete linkage is sensitive to outlier genes. Average linkage with correlation distance is the classic choice in publications like the original Eisen et al. (1998) paper that introduced cluster/heatmap analysis in genomics. In scipy: first compute 1 - np.corrcoef(expression_matrix) to get a correlation distance matrix, convert to condensed form with squareform, then run linkage(dist_condensed, method='average').

Inconsistency Criterion: Automatic Cut Height

Beyond looking for the largest gap manually, scipy provides the inconsistency criterion for automatically selecting a cut height. For each merge in the dendrogram, the inconsistency coefficient measures how different the merge height is from the average height of merges below it in the same subtree:

inconsistency(m)=h(m)hˉ(m)σ(m)\text{inconsistency}(m) = \frac{h(m) - \bar{h}(m)}{\sigma(m)}

where h(m)h(m) is the height of merge mm, and hˉ(m)\bar{h}(m) and σ(m)\sigma(m) are the mean and standard deviation of heights in the subtree rooted at mm (looking dd levels deep).

A high inconsistency coefficient at merge mm means the merge at that step is much larger than the merges just below it - a strong signal that the two clusters being merged are genuinely distinct.

from scipy.cluster.hierarchy import linkage, inconsistent, fcluster
from scipy.spatial.distance import pdist
import numpy as np
import matplotlib.pyplot as plt

def auto_cut_dendrogram(X: np.ndarray, method: str = 'ward',
depth: int = 2, threshold: float = 1.5) -> np.ndarray:
"""
Automatically cut a dendrogram using the inconsistency criterion.

depth: number of levels to look below each merge when computing inconsistency
threshold: inconsistency score above which to cut (1.5 is a common default)

Returns: cluster labels for each sample.
"""
Z = linkage(X, method=method)

# Compute inconsistency matrix: (n-1, 4) - [mean, std, count, inconsistency]
incons = inconsistent(Z, d=depth)
inconsistency_scores = incons[:, 3]

print(f"Inconsistency score range: "
f"[{inconsistency_scores.min():.3f}, {inconsistency_scores.max():.3f}]")

# fcluster with 'inconsistent' criterion:
# cuts the tree where the inconsistency score exceeds the threshold
labels = fcluster(Z, t=threshold, criterion='inconsistent', depth=depth)

n_clusters = len(np.unique(labels))
print(f"Auto-selected K = {n_clusters} (threshold={threshold})")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Inconsistency scores along the merge sequence
axes[0].plot(range(len(inconsistency_scores)), inconsistency_scores,
'o-', ms=3, color='#3b82f6')
axes[0].axhline(y=threshold, color='#dc2626', linestyle='--',
label=f'Threshold={threshold}')
axes[0].set_xlabel("Merge step")
axes[0].set_ylabel("Inconsistency coefficient")
axes[0].set_title("Inconsistency Criterion")
axes[0].legend()

# Scatter with auto-selected clusters
if X.shape[1] == 2:
axes[1].scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', s=20, alpha=0.7)
axes[1].set_title(f"Auto-cut: K={n_clusters}")
axes[1].axis('off')

plt.tight_layout()
plt.show()

return labels


# Usage
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

X_demo, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.9, random_state=42)
X_demo = StandardScaler().fit_transform(X_demo)

labels_auto = auto_cut_dendrogram(X_demo, method='ward', depth=2, threshold=1.5)

Evaluating Cluster Quality Without Labels

When you have no ground truth labels, use internal cluster quality metrics:

from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, fcluster
import numpy as np
import matplotlib.pyplot as plt

def evaluate_hierarchical_clustering(X: np.ndarray,
method: str = 'ward',
k_range: range = range(2, 11)) -> dict:
"""
Evaluate hierarchical clustering at multiple K values using internal metrics.

Silhouette: higher is better (range: -1 to 1)
Calinski-Harabasz: higher is better (no upper bound)
Davies-Bouldin: lower is better (lower = more compact, well-separated clusters)
"""
Z = linkage(X, method=method)

results = {'K': [], 'silhouette': [], 'calinski': [], 'davies_bouldin': []}

for K in k_range:
labels = fcluster(Z, t=K, criterion='maxclust')

sil = silhouette_score(X, labels)
cal = calinski_harabasz_score(X, labels)
db = davies_bouldin_score(X, labels)

results['K'].append(K)
results['silhouette'].append(sil)
results['calinski'].append(cal)
results['davies_bouldin'].append(db)

print(f"K={K:2d} Silhouette={sil:.4f} "
f"Calinski-Harabasz={cal:.1f} Davies-Bouldin={db:.4f}")

# Plot
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].plot(results['K'], results['silhouette'], 'o-', color='#3b82f6', ms=6)
axes[0].set_xlabel("K")
axes[0].set_ylabel("Silhouette Score")
axes[0].set_title("Silhouette (higher=better)")

axes[1].plot(results['K'], results['calinski'], 'o-', color='#16a34a', ms=6)
axes[1].set_xlabel("K")
axes[1].set_ylabel("Calinski-Harabasz Score")
axes[1].set_title("Calinski-Harabasz (higher=better)")

axes[2].plot(results['K'], results['davies_bouldin'], 'o-', color='#dc2626', ms=6)
axes[2].set_xlabel("K")
axes[2].set_ylabel("Davies-Bouldin Score")
axes[2].set_title("Davies-Bouldin (lower=better)")

for ax in axes:
ax.grid(True, alpha=0.3)

plt.suptitle(f"Cluster Quality Metrics - {method} linkage", fontsize=13)
plt.tight_layout()
plt.show()

# Select best K by silhouette
best_K = results['K'][int(np.argmax(results['silhouette']))]
print(f"\nBest K by silhouette: {best_K}")

return results


# Run evaluation
results = evaluate_hierarchical_clustering(X_demo, method='ward', k_range=range(2, 11))

When metrics disagree: Use the silhouette score as the primary metric - it directly measures how well each point fits its assigned cluster vs the nearest neighboring cluster. Calinski-Harabasz tends to favor large K values when clusters vary in size. Davies-Bouldin is sensitive to outliers. When all three metrics agree on the same K, it is a strong signal.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Hierarchical Clustering Dendrogram demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.