DBSCAN and Density-Based Clustering
The Real Interview Moment
Your fraud detection team receives GPS coordinates for every card transaction - latitude, longitude, transaction amount, and time of day. You need to flag unusual transaction locations automatically, without knowing how many "home clusters" each user has. The clusters (cities where a user regularly transacts) have arbitrary shapes: urban sprawl, commute corridors, vacation patterns. They vary vastly in density - Manhattan has thousands of transactions in a square mile, rural Iowa has one per hundred square miles. And you expect noise: airport transactions, one-off hotel payments, road trips.
You bring K-means to the first team meeting. An engineer asks: "What K do we use per user?" You freeze. A user who lives in New York but travels to Chicago monthly needs K=2. A user who lives in one city their whole life needs K=1. A traveler needs K=8. K-means requires you to specify K in advance - but the number of legitimate clusters varies per user and is unknown. Worse, K-means will force every outlier transaction into a cluster, even the fraudulent midnight charge in Moscow.
DBSCAN handles all of this naturally. It identifies high-density regions as clusters without requiring K to be specified. It labels low-density isolated points as noise - directly interpretable as anomalies. It makes no assumptions about cluster shapes or sizes. And it extends naturally to HDBSCAN, which handles the variable-density problem between Manhattan and rural Iowa.
This lesson covers the full density-based clustering family - DBSCAN, OPTICS, HDBSCAN, and Mean Shift - with complete implementations, parameter selection strategies, and production anomaly detection patterns.
Why K-Means Fails Here: The Shape Problem
Before DBSCAN (1996), all major clustering algorithms assumed clusters had regular, convex geometry. K-means minimizes within-cluster sum of squares, which produces Voronoi-cell boundaries - always convex, always linear. The assignment rule is "nearest centroid," which means every cluster is a hyperball around its center.
This fails in three specific ways that matter in practice:
Non-convex shapes: Geographic clusters follow roads, coastlines, and city boundaries - crescent shapes, elongated corridors, concentric rings. K-means cuts through them with straight lines because it cannot represent non-convex cluster boundaries.
No noise handling: K-means assigns every point to a cluster. A transaction in the middle of the Pacific Ocean still gets assigned to "cluster 3" because that centroid is nearest. There is no concept of an outlier.
Uniform density assumption: K-means assumes clusters have roughly equal density because it uses a global ε-equivalent (the centroid distance) to decide membership. It cannot simultaneously capture both a dense urban cluster and a sparse rural cluster.
DBSCAN solves all three by redefining what a "cluster" means: not a region around a centroid, but a maximally connected dense region. Two points belong to the same cluster if there is a chain of dense neighborhoods connecting them - regardless of path shape.
Historical Context
1996: Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu publish "A density-based algorithm for discovering clusters in large spatial databases with noise" at KDD. The paper introduces DBSCAN and the three-way classification of points as core, border, and noise. It won the KDD 2014 Test of Time Award - rare recognition that an algorithm aged extremely well.
1999: The same group publishes OPTICS (Ordering Points To Identify the Clustering Structure), which addresses DBSCAN's inability to handle clusters of varying density by building an ordered reachability plot.
2017: Leland McInnes and John Healy publish HDBSCAN (Hierarchical DBSCAN), which extends OPTICS with a principled cluster extraction strategy based on stability. This is now the recommended default for density-based clustering in most applications.
The key insight in DBSCAN: instead of defining clusters by their central points (K-means, GMM), define them by density connectivity - a point belongs to a cluster if it is reachable through a chain of dense regions. This single shift in definition unlocks arbitrary cluster shapes, natural noise handling, and no requirement to specify K.
DBSCAN: Core Concepts
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions of points separated by sparse regions. It requires two parameters:
- ε (epsilon): the neighborhood radius. A point's ε-neighborhood is the set of all points within distance ε from .
- MinPts (min_samples): the minimum number of points required for a region to be considered "dense."
Three Point Types
Every point is classified as exactly one of three types:
Core point: A point is a core point if it has at least MinPts points within radius ε, including itself:
Core points are the "interior" of clusters - they are surrounded by sufficient density to form the skeleton of a cluster.
Border point: A point is a border point if it is within distance ε of a core point (directly reachable from a core), but does not itself have MinPts neighbors. Border points are on the edge of clusters - they belong to a cluster but cannot propagate it further.
Noise point: A point that is neither a core point nor within distance ε of any core point. These are true outliers - DBSCAN labels them −1. They are isolated, low-density points that do not belong to any cluster.
Visualization: ε = 1.5, MinPts = 3
Points: Classification:
* * * C C C (core points - 3+ neighbors each)
* . * . * C B C B C (. = border, C = core)
* * * C C C
* N (isolated → noise)
* * * C C C (second cluster, density-connected)
* * * C C C
Density-Reachability and Connectivity
DBSCAN builds clusters through the concept of density-connectivity:
Directly density-reachable: is directly density-reachable from core point if . Note: this is not symmetric - must be a core point, but need not be.
Density-reachable: is density-reachable from if there exists a chain where , , and each is directly density-reachable from core point .
Density-connected: and are density-connected if there exists a point such that both and are density-reachable from .
A cluster is a maximal set of density-connected points. This definition allows clusters of any shape - any connected dense region forms one cluster, regardless of its geometry. Two elongated arcs that share a bridge of dense points will be merged into one cluster.
The DBSCAN Algorithm
DBSCAN(X, ε, MinPts):
cluster_id = 0
labels = [-1] * len(X) # -1 = unvisited / noise
for each point p in X:
if labels[p] != -1: continue # already assigned
neighbors = RangeQuery(X, p, ε) # all points within ε of p
if len(neighbors) < MinPts:
labels[p] = -1 # mark as noise (may be relabeled later)
continue
# p is a core point - start new cluster
cluster_id += 1
labels[p] = cluster_id
seed_set = neighbors - {p} # points to expand from
while seed_set is not empty:
q = seed_set.pop()
if labels[q] == -1:
labels[q] = cluster_id # relabel noise as border point
if labels[q] != -1: continue # already in a cluster
labels[q] = cluster_id
q_neighbors = RangeQuery(X, q, ε)
if len(q_neighbors) >= MinPts: # q is also a core point
seed_set = seed_set ∪ q_neighbors # expand from q
return labels
The key property: when the algorithm reaches a core point, it floods all density-reachable points into the same cluster via the seed_set. Border points are absorbed when a core point's neighborhood includes them, but they do not themselves expand the cluster.
Time Complexity Analysis
The bottleneck is the RangeQuery - finding all points within distance ε of a given point.
Naive approach (O(n²)): For each of points, scan all points and check distance. Total: .
With spatial index (O(n log n)): Build a ball tree or k-d tree on the data. Each range query runs in amortized. Since we perform at most range queries, total cost is .
The spatial index is critical for performance. Sklearn's DBSCAN automatically selects ball_tree for low-dimensional data and brute force for high dimensions or small datasets.
High-dimensional degradation: Spatial indices lose effectiveness when - in high dimensions, all points are approximately equidistant (curse of dimensionality) and the index provides little speedup. For high-dimensional data, apply dimensionality reduction (PCA, UMAP) before DBSCAN.
NumPy DBSCAN Implementation from Scratch
import numpy as np
from collections import deque
def range_query(X: np.ndarray, point_idx: int, eps: float) -> list:
"""
Find all points within Euclidean distance eps of X[point_idx].
O(n) naive scan - for production use a BallTree or KD-Tree.
"""
point = X[point_idx]
# Vectorized distance computation
diffs = X - point # (n, d)
sq_dists = np.sum(diffs ** 2, axis=1) # (n,)
return list(np.where(sq_dists <= eps ** 2)[0])
def dbscan_scratch(X: np.ndarray, eps: float,
min_samples: int) -> np.ndarray:
"""
DBSCAN from scratch using the flooding algorithm.
Parameters
----------
X : (n, d) data array
eps : neighborhood radius
min_samples : minimum points for a core region (including self)
Returns
-------
labels : (n,) array, -1 = noise, 0..K-1 = cluster ids
"""
n = len(X)
labels = np.full(n, -2, dtype=int) # -2 = unvisited
cluster_id = -1
for i in range(n):
if labels[i] != -2:
continue # already processed
neighbors = range_query(X, i, eps)
if len(neighbors) < min_samples:
labels[i] = -1 # noise point (may be relabeled)
continue
# Point i is a core point - start new cluster
cluster_id += 1
labels[i] = cluster_id
# BFS expansion
seed_set = deque(neighbors)
while seed_set:
j = seed_set.popleft()
if labels[j] == -1:
# Was noise - relabel as border of this cluster
labels[j] = cluster_id
if labels[j] != -2:
continue # already in a cluster
labels[j] = cluster_id
j_neighbors = range_query(X, j, eps)
if len(j_neighbors) >= min_samples:
# j is also a core point - expand from it
seed_set.extend(j_neighbors)
# Remaining -2 (unvisited) should not exist - all points processed
assert (labels == -2).sum() == 0, "Some points were not visited!"
return labels
# Test on make_moons
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
X_moons, y_true = make_moons(n_samples=300, noise=0.08, random_state=42)
X_scaled = StandardScaler().fit_transform(X_moons)
labels_scratch = dbscan_scratch(X_scaled, eps=0.3, min_samples=5)
n_clusters = labels_scratch.max() + 1
n_noise = (labels_scratch == -1).sum()
print(f"From scratch: {n_clusters} clusters, {n_noise} noise points")
# Verify against sklearn
from sklearn.cluster import DBSCAN
labels_sklearn = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_scaled)
print(f"Sklearn: {labels_sklearn.max()+1} clusters, "
f"{(labels_sklearn==-1).sum()} noise points")
Sklearn DBSCAN with Parameter Tuning
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs, make_circles
from sklearn.preprocessing import StandardScaler
# --- Example 1: Non-spherical clusters (K-means fails, DBSCAN wins) ---
X_moons, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5, metric='euclidean', algorithm='ball_tree')
labels = dbscan.fit_predict(X_moons)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise} ({100*n_noise/len(labels):.1f}%)")
print(f"Core sample count: {len(dbscan.core_sample_indices_)}")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Moons
colors = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels]
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=colors, s=30, alpha=0.7)
axes[0].set_title(f"DBSCAN on Moons - {n_clusters} clusters, {n_noise} noise")
# Concentric circles
X_circles, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
labels_circles = DBSCAN(eps=0.15, min_samples=5).fit_predict(X_circles)
n_clusters_circles = len(set(labels_circles)) - (1 if -1 in labels_circles else 0)
colors_circles = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels_circles]
axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=colors_circles, s=30, alpha=0.7)
axes[1].set_title(f"DBSCAN on Circles - {n_clusters_circles} clusters")
# Blobs with outliers
X_blobs, _ = make_blobs(n_samples=200, centers=3, cluster_std=0.5, random_state=42)
outliers = np.random.default_rng(42).uniform(-8, 8, size=(30, 2))
X_with_outliers = np.vstack([X_blobs, outliers])
labels_blobs = DBSCAN(eps=0.8, min_samples=5).fit_predict(X_with_outliers)
n_noise_blobs = (labels_blobs == -1).sum()
colors_blobs = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels_blobs]
axes[2].scatter(X_with_outliers[:, 0], X_with_outliers[:, 1],
c=colors_blobs, s=30, alpha=0.7)
axes[2].set_title(f"DBSCAN on Blobs+Outliers - {n_noise_blobs} noise (red)")
plt.suptitle("DBSCAN on Various Cluster Shapes", fontsize=14)
plt.tight_layout()
plt.show()
Choosing ε and MinPts: The k-Distance Graph
This is the central practical challenge with DBSCAN. Poor parameter choices lead to degenerate results: too-small ε makes everything noise; too-large ε merges all clusters into one.
Choosing MinPts
Rule of thumb: MinPts ≥ dimensionality + 1. For 2D data: MinPts = 3. For 10D data: MinPts ≥ 11. For noisy data, use higher values (10–20) to require denser regions.
The intuition: MinPts controls the minimum "resolution" of a cluster. Higher MinPts means you only identify very dense regions as cluster cores. This filters out low-density clusters but also makes the algorithm more robust to noise.
A secondary rule: MinPts = 2 always works (any pair of mutually ε-close points forms a cluster), but produces poor results. The practical recommendation is MinPts = 2 × dimensionality.
Choosing ε: The k-Distance Plot
The best approach for ε selection is the k-distance plot (also called the k-NN elbow plot):
- For each point, compute the distance to its k-th nearest neighbor (where = MinPts − 1)
- Sort these distances in descending order
- Plot the sorted distances
- Look for the "elbow" - the point where the sorted distances make a sharp bend
Points below the elbow have their k-th neighbor within a dense cluster. Points above the elbow are in sparse regions (noise). The ε value at the elbow is the natural density threshold.
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
def plot_knn_distances(X: np.ndarray, min_samples: int = 5,
n_candidates: int = 5):
"""
Plot the k-NN distance graph for ε selection.
The elbow in this plot is a good starting value for ε.
Parameters
----------
X : (n, d) data array (should be standardized first)
min_samples : the MinPts parameter you plan to use
n_candidates : number of candidate ε values to suggest
Returns
-------
k_distances : sorted k-NN distances
suggested_eps : estimated elbow value
"""
k = min_samples - 1 # k-th nearest neighbor distance
nbrs = NearestNeighbors(n_neighbors=k + 1, algorithm='ball_tree')
nbrs.fit(X)
distances, _ = nbrs.kneighbors(X)
# Distance to k-th nearest neighbor (col k = k-th neighbor, col 0 = self)
k_distances = np.sort(distances[:, k])[::-1] # descending
# Estimate elbow via maximum curvature
# Second derivative of sorted k-distances
second_deriv = np.diff(np.diff(k_distances))
elbow_idx = np.argmax(np.abs(second_deriv)) + 1
suggested_eps = k_distances[elbow_idx]
plt.figure(figsize=(10, 5))
plt.plot(k_distances, lw=2, color='steelblue')
plt.axvline(x=elbow_idx, color='orange', linestyle='--', alpha=0.7)
plt.axhline(y=suggested_eps, color='red', linestyle='--',
label=f'Suggested ε ≈ {suggested_eps:.3f}')
plt.xlabel("Points (sorted by k-NN distance, descending)")
plt.ylabel(f"Distance to {k}-th nearest neighbor")
plt.title(f"k-Distance Plot (k={k}, MinPts={min_samples})\n"
f"Elbow → suggested ε = {suggested_eps:.3f}")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Suggest a range of candidates around the elbow
candidates = np.percentile(k_distances, np.linspace(5, 25, n_candidates))
print(f"Candidate ε values: {candidates.round(3)}")
print(f"Elbow estimate: {suggested_eps:.4f}")
return k_distances, suggested_eps
# Always standardize before running DBSCAN
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_moons)
k_dists, eps_candidate = plot_knn_distances(X_scaled, min_samples=5)
Silhouette-Based Grid Search for ε
When the k-distance plot gives an ambiguous elbow, grid-search over candidate ε values:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np
def tune_dbscan(X: np.ndarray, min_samples: int = 5,
eps_range: np.ndarray = None):
"""
Grid search over ε values, scored by silhouette on non-noise points.
"""
if eps_range is None:
eps_range = np.linspace(0.1, 1.5, 30)
results = []
for eps in eps_range:
labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise_frac = (labels == -1).mean()
# Only score if we have at least 2 clusters and not too much noise
if n_clusters < 2 or noise_frac > 0.5:
results.append({'eps': eps, 'n_clusters': n_clusters,
'noise_frac': noise_frac, 'silhouette': np.nan})
continue
# Compute silhouette only on non-noise points
mask = labels != -1
if mask.sum() < 2 * n_clusters: # too few points per cluster
results.append({'eps': eps, 'n_clusters': n_clusters,
'noise_frac': noise_frac, 'silhouette': np.nan})
continue
sil = silhouette_score(X[mask], labels[mask])
results.append({'eps': eps, 'n_clusters': n_clusters,
'noise_frac': noise_frac, 'silhouette': sil})
import pandas as pd
results_df = pd.DataFrame(results)
valid = results_df.dropna(subset=['silhouette'])
if len(valid) == 0:
print("No valid ε found - try a different range or min_samples")
return results_df, None
best = valid.loc[valid['silhouette'].idxmax()]
print(f"Best ε = {best['eps']:.3f}: "
f"{int(best['n_clusters'])} clusters, "
f"{best['noise_frac']:.1%} noise, "
f"silhouette = {best['silhouette']:.4f}")
return results_df, best['eps']
results, best_eps = tune_dbscan(X_scaled, min_samples=5,
eps_range=np.linspace(0.05, 0.8, 40))
DBSCAN as an Anomaly Detector
Points labeled −1 (noise) are anomalies by definition - they do not belong to any dense cluster. This makes DBSCAN a natural unsupervised anomaly detection tool, requiring no labeled anomaly data.
The workflow: fit DBSCAN on normal operational data. Any new point that falls outside all existing dense clusters (distance to nearest core point exceeds ε) is flagged as an anomaly.
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
# Simulate transaction data: lat, lon, amount, hour
np.random.seed(42)
n_legit = 2000
n_fraud = 40
# Legitimate transactions cluster around home city (NYC area)
legitimate = np.column_stack([
np.random.normal(40.73, 0.06, n_legit), # lat: NYC
np.random.normal(-73.99, 0.06, n_legit), # lon: NYC
np.random.exponential(60, n_legit), # amount: typical purchases
np.random.choice(range(8, 22), n_legit), # hour: business hours
])
# Add a second cluster (user travels to Chicago monthly)
chicago_visits = np.column_stack([
np.random.normal(41.88, 0.04, 100),
np.random.normal(-87.63, 0.04, 100),
np.random.exponential(80, 100),
np.random.choice(range(8, 22), 100),
])
legitimate = np.vstack([legitimate, chicago_visits])
n_legit = len(legitimate)
# Fraudulent: random US locations, unusual amounts, late night
fraud = np.column_stack([
np.random.uniform(30, 48, n_fraud),
np.random.uniform(-120, -70, n_fraud),
np.random.exponential(800, n_fraud), # large unusual amounts
np.random.choice(range(0, 5), n_fraud), # late night: 12am-5am
])
X = np.vstack([legitimate, fraud])
y_true = np.array([0]*n_legit + [1]*n_fraud) # 0=legit, 1=fraud
# Scale: all features to comparable range
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Run DBSCAN
db = DBSCAN(eps=0.5, min_samples=10, algorithm='ball_tree')
labels = db.fit_predict(X_scaled)
# Noise points (-1) are anomaly candidates
predicted_anomalies = (labels == -1).astype(int)
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
print(f"Precision: {precision_score(y_true, predicted_anomalies):.3f}")
print(f"Recall: {recall_score(y_true, predicted_anomalies):.3f}")
print(f"F1: {f1_score(y_true, predicted_anomalies):.3f}")
print(f"\nConfusion matrix:")
print(confusion_matrix(y_true, predicted_anomalies))
Production Anomaly Detection Pipeline
The key insight for production: fit DBSCAN offline to learn the "normal" cluster structure, then classify new points at inference time by their distance to existing core points - without re-running the full DBSCAN.
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
import numpy as np
class DBSCANAnomalyDetector:
"""
Unsupervised anomaly detector using DBSCAN.
Fit on normal baseline data; score new observations by proximity to core points.
"""
def __init__(self, eps: float = 0.5, min_samples: int = 10):
self.eps = eps
self.min_samples = min_samples
self.scaler = StandardScaler()
self._dbscan = DBSCAN(eps=eps, min_samples=min_samples,
algorithm='ball_tree')
self._core_nn = None
self.is_fitted = False
def fit(self, X_normal: np.ndarray) -> 'DBSCANAnomalyDetector':
"""
Learn the density structure of normal data.
Parameters
----------
X_normal : (n, d) array of normal (non-anomalous) observations
"""
X_scaled = self.scaler.fit_transform(X_normal)
self._dbscan.fit(X_scaled)
# Extract core sample positions for fast inference
core_indices = self._dbscan.core_sample_indices_
if len(core_indices) == 0:
raise ValueError(
"No core points found - try increasing eps or decreasing min_samples"
)
self._X_core = X_scaled[core_indices] # (n_core, d)
self._core_nn = NearestNeighbors(
n_neighbors=1, algorithm='ball_tree', metric='euclidean'
)
self._core_nn.fit(self._X_core)
self.is_fitted = True
print(f"Fitted: {len(X_normal)} training points, "
f"{len(core_indices)} core points, "
f"{int(self._dbscan.labels_.max()) + 1} clusters")
return self
def anomaly_score(self, X: np.ndarray) -> np.ndarray:
"""
Return anomaly scores: distance to nearest core point.
Higher score = more anomalous.
Score > eps → anomaly; Score <= eps → normal.
"""
assert self.is_fitted, "Call fit() first"
X_scaled = self.scaler.transform(X)
distances, _ = self._core_nn.kneighbors(X_scaled)
return distances[:, 0] # (n,)
def predict(self, X: np.ndarray,
threshold: float = None) -> np.ndarray:
"""
Return predictions: -1 = anomaly, 0 = normal.
Default threshold = eps (same as DBSCAN definition).
"""
threshold = threshold or self.eps
scores = self.anomaly_score(X)
return np.where(scores > threshold, -1, 0)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""
Return anomaly probability as sigmoid of (score - eps) / eps.
"""
scores = self.anomaly_score(X)
# Map: score=0 → prob≈0, score=eps → prob≈0.5, score=2*eps → prob≈0.73
z = (scores - self.eps) / (self.eps + 1e-8)
return 1 / (1 + np.exp(-z))
# Usage
detector = DBSCANAnomalyDetector(eps=0.5, min_samples=10)
detector.fit(legitimate[:1000]) # fit on training set only
# Score new transactions - high score = anomaly
test_legit = legitimate[1000:1100]
test_fraud = fraud[:20]
X_test = np.vstack([test_legit, test_fraud])
y_test = np.array([0]*100 + [1]*20)
scores = detector.anomaly_score(X_test)
labels_pred = detector.predict(X_test)
from sklearn.metrics import roc_auc_score
print(f"ROC-AUC: {roc_auc_score(y_test, scores):.3f}")
print(f"Precision: {precision_score(y_test, labels_pred == -1):.3f}")
print(f"Recall: {recall_score(y_test, labels_pred == -1):.3f}")
OPTICS: Ordering Points to Identify Clustering Structure
OPTICS (Kriegel et al., 1999) extends DBSCAN to handle clusters of varying density - the core limitation of DBSCAN. Instead of committing to a single ε, OPTICS builds a reachability plot that encodes the hierarchical density structure at all scales simultaneously.
Core Concepts
Core distance of point : the distance to its MinPts-th nearest neighbor. If has fewer than MinPts neighbors within any radius, its core distance is undefined (it can never be a core point).
Reachability distance from core point to point :
This is the maximum of the distance between and , and the core distance of . It smooths over the density of the origin point.
OPTICS processes points in order of increasing reachability distance and produces a reachability plot. The plot has valleys (low reachability) corresponding to dense clusters and peaks (high reachability) corresponding to sparse regions between clusters. Clusters can be extracted by finding valleys at any horizontal threshold - this is equivalent to running DBSCAN at that ε level.
from sklearn.cluster import OPTICS
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Create data with three clusters of different densities
centers = [[0, 0], [5, 5], [10, 0]]
X_varied, _ = make_blobs(
n_samples=[100, 50, 200], # different sizes
centers=centers,
cluster_std=[0.3, 0.8, 0.5], # different densities
random_state=42
)
optics = OPTICS(
min_samples=5,
max_eps=np.inf, # no upper bound on ε (full hierarchy)
metric='euclidean',
cluster_method='xi', # extract clusters by xi-steep descent
xi=0.05 # minimum steepness threshold
)
optics.fit(X_varied)
labels_optics = optics.labels_
n_clusters = labels_optics.max() + 1
print(f"OPTICS found {n_clusters} clusters, "
f"{(labels_optics == -1).sum()} noise points")
# Reachability plot - the key visualization for OPTICS
reachability = optics.reachability_[optics.ordering_]
cluster_colors = ['steelblue', 'tomato', 'forestgreen', 'purple']
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Reachability plot
for klass in range(n_clusters):
Xk = reachability[labels_optics[optics.ordering_] == klass]
axes[0].plot(Xk, '.', ms=4,
color=cluster_colors[klass % len(cluster_colors)])
Xn = reachability[labels_optics[optics.ordering_] == -1]
axes[0].plot(Xn, 'k.', ms=2, alpha=0.3, label='Noise')
axes[0].set_ylabel("Reachability Distance")
axes[0].set_title("OPTICS Reachability Plot\nValleys = clusters")
axes[0].legend()
# Cluster scatter
colors_optics = [cluster_colors[l % len(cluster_colors)]
if l != -1 else 'black'
for l in labels_optics]
axes[1].scatter(X_varied[:, 0], X_varied[:, 1],
c=colors_optics, s=20, alpha=0.7)
axes[1].set_title("OPTICS Cluster Result - handles varying density")
plt.tight_layout()
plt.show()
HDBSCAN: Hierarchical DBSCAN
HDBSCAN (McInnes & Healy, 2017) is the modern state-of-the-art for density-based clustering. It extends OPTICS with a rigorous mathematical framework for cluster extraction based on cluster stability.
How HDBSCAN Works
Step 1 - Transform the space: Compute the mutual reachability distance:
This transformation makes the space more uniform, reducing the impact of density variations.
Step 2 - Build a minimum spanning tree: Compute the minimum spanning tree (MST) of the complete graph with mutual reachability distances as edge weights. This is naively but with Prim's algorithm on a k-NN graph.
Step 3 - Build a cluster hierarchy: Sort the MST edges by decreasing weight. Remove edges one at a time. Each removal splits a component - this is equivalent to running DBSCAN at every possible ε simultaneously. The result is a dendrogram.
Step 4 - Extract stable clusters: Assign each cluster a stability score :
where (inverse epsilon), is when the cluster formed, and is when the point left the cluster (or the cluster split). Clusters with high stability are preserved; those whose children are more stable are replaced by their children.
This gives HDBSCAN a critical advantage: it automatically selects the scale at which to extract each cluster, enabling it to find clusters at different density levels simultaneously.
# pip install hdbscan
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# Data with very different densities
X1, _ = make_blobs(n_samples=200, centers=[[0,0]], cluster_std=0.2, random_state=42)
X2, _ = make_blobs(n_samples=200, centers=[[5,5]], cluster_std=1.2, random_state=42)
X_varying = np.vstack([X1, X2])
# DBSCAN: must choose a single eps - will either miss X2 or merge X1 into noise
labels_dbscan = DBSCAN(eps=0.5, min_samples=5).fit_predict(X_varying)
print(f"DBSCAN: {labels_dbscan.max()+1} clusters, "
f"{(labels_dbscan==-1).sum()} noise")
# HDBSCAN: handles both densities automatically
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15, # smallest cluster to extract
min_samples=5, # density threshold (default = min_cluster_size)
cluster_selection_method='eom', # excess of mass (default, recommended)
metric='euclidean',
prediction_data=True # enables soft clustering and new-point prediction
)
clusterer.fit(X_varying)
labels_hdb = clusterer.labels_
print(f"HDBSCAN: {labels_hdb.max()+1} clusters, "
f"{(labels_hdb==-1).sum()} noise")
# HDBSCAN provides rich outputs beyond labels:
print(f"\nCluster membership probabilities (first 5):")
print(clusterer.probabilities_[:5].round(3)) # how confidently assigned
print(f"\nOutlier scores (first 5, higher = more anomalous):")
print(clusterer.outlier_scores_[:5].round(3))
# Top 10 most anomalous points
top_anomalies = np.argsort(clusterer.outlier_scores_)[-10:]
print(f"\nTop anomaly indices: {top_anomalies}")
# Predict new points (requires prediction_data=True)
X_new = np.random.randn(5, 2) * 2 + 2.5
new_labels, new_probs = hdbscan.approximate_predict(clusterer, X_new)
print(f"\nNew point labels: {new_labels}")
print(f"New point probs: {new_probs.round(3)}")
HDBSCAN Parameter Guide
# The main parameter is min_cluster_size
# Rule of thumb: set to the smallest cluster size you care about
# Small min_cluster_size (5-10): many small clusters, more noise
clusterer_fine = hdbscan.HDBSCAN(min_cluster_size=5)
clusterer_fine.fit(X_varying)
# Large min_cluster_size (50-100): fewer, larger clusters
clusterer_coarse = hdbscan.HDBSCAN(min_cluster_size=50)
clusterer_coarse.fit(X_varying)
# min_samples (optional): controls how conservative core point definition is
# Higher min_samples → fewer core points → more noise (more conservative)
clusterer_conservative = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=10)
clusterer_conservative.fit(X_varying)
# cluster_selection_method:
# 'eom' (excess of mass): default, finds compact clusters
# 'leaf': always takes the leaf clusters - produces many small clusters
clusterer_leaf = hdbscan.HDBSCAN(min_cluster_size=15,
cluster_selection_method='leaf')
clusterer_leaf.fit(X_varying)
for name, c in [('fine', clusterer_fine),
('coarse', clusterer_coarse),
('conservative', clusterer_conservative),
('leaf', clusterer_leaf)]:
n_cl = c.labels_.max() + 1
n_noise = (c.labels_ == -1).sum()
print(f"{name:14s}: {n_cl} clusters, {n_noise} noise")
Mean Shift Clustering
Mean Shift is a kernel density estimation approach to clustering that finds modes (local maxima) of the data density. Unlike DBSCAN, it requires no MinPts parameter - only a bandwidth (analogous to ε but for the kernel).
Algorithm
For each point :
-
Compute the weighted mean of all points within bandwidth : where is a kernel function (typically Gaussian or flat/uniform).
-
Update - shift the point toward the weighted mean.
-
Repeat until convergence. Each point converges to a local density maximum.
-
Points that converge to the same mode are assigned to the same cluster.
from sklearn.cluster import MeanShift, estimate_bandwidth
import numpy as np
from sklearn.datasets import make_blobs
X_ms, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)
# Estimate bandwidth using the rule of thumb: 0.3 quantile of pairwise distances
bandwidth = estimate_bandwidth(X_ms, quantile=0.2, n_samples=500)
print(f"Estimated bandwidth: {bandwidth:.3f}")
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X_ms)
labels_ms = ms.labels_
centers_ms = ms.cluster_centers_
print(f"Clusters found: {len(np.unique(labels_ms))}")
print(f"Cluster centers:\n{centers_ms.round(2)}")
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.scatter(X_ms[:, 0], X_ms[:, 1], c=labels_ms, cmap='tab10',
s=20, alpha=0.6)
plt.scatter(centers_ms[:, 0], centers_ms[:, 1],
c='red', marker='X', s=200, zorder=5, label='Mode centers')
plt.title(f"Mean Shift Clustering - {len(np.unique(labels_ms))} clusters")
plt.legend()
plt.show()
Mean Shift advantages: it automatically determines K (one cluster per mode), handles arbitrary shapes, and naturally produces cluster centers that are actual data density peaks. Disadvantage: per iteration - slow for large datasets. Use HDBSCAN instead for large-scale work.
Algorithm Comparison
| Property | K-Means | DBSCAN | GMM | HDBSCAN |
|---|---|---|---|---|
| Specify K | Required | Not required | Required | Not required |
| Cluster shape | Spherical (Voronoi) | Arbitrary | Elliptical | Arbitrary |
| Varying density | Fails | Fails | Soft handling | Handles well |
| Outlier handling | None (all assigned) | Hard noise labels | Soft (low prob) | Soft outlier scores |
| Time complexity | ||||
| Soft assignments | No | No | Yes | Yes (probabilities) |
| Parameter sensitivity | Medium (K selection) | High (ε critical) | Medium | Low (robust) |
| Scalability | Excellent | Good | Moderate | Good |
| Interpretability | High (centroids) | Medium | Medium | Medium |
YouTube Resources
| Title | Channel | Why Watch |
|---|---|---|
| StatQuest: DBSCAN | StatQuest with Josh Starmer | Clear visual walkthrough of core/border/noise classification |
| HDBSCAN Explained | Leland McInnes (author) | The algorithm's creator explains mutual reachability and stability |
| Clustering Algorithms Comparison | Normalized Nerd | Side-by-side comparison of K-means, DBSCAN, and GMM on various datasets |
| Anomaly Detection with DBSCAN | Python Engineer | Production anomaly detection patterns with sklearn |
| OPTICS Algorithm Explained | ritvikmath | Reachability plots and hierarchical cluster extraction |
Common Mistakes
:::danger Forgetting to Standardize Before DBSCAN
DBSCAN uses Euclidean distance. A feature with range 0–100,000 (income) will dominate over a feature with range 0–1 (probability). Always StandardScaler().fit_transform(X) before DBSCAN. This is not optional - it changes which points are considered neighbors and can completely change the cluster structure.
:::
:::danger Using DBSCAN for Uniformly Dense Spherical Data If your data genuinely has well-separated spherical clusters of similar density, K-means is a better choice - it is faster, easier to tune, and produces interpretable centroids. DBSCAN's advantages (arbitrary shapes, noise handling) are irrelevant in that case, and its sensitivity to ε becomes a liability. Choose the right tool for the data geometry. :::
:::warning DBSCAN Cannot Predict New Points
After fitting DBSCAN with fit_predict(), there is no predict() method - DBSCAN does not produce centroids or density models that generalize to new points. For production use where new points must be classified, you must: (1) extract core points after training, (2) build a nearest-neighbor index over core points, (3) classify new points by whether their nearest core point is within ε. This is exactly the DBSCANAnomalyDetector pattern above.
:::
:::warning DBSCAN Fails on Clusters of Varying Density If your clusters have dramatically different densities - an urban cluster with 1000 points/km² and a rural cluster with 2 points/km² - a single ε cannot simultaneously capture both. The urban cluster needs a small ε; the rural cluster needs a large one. DBSCAN will either label the rural points as noise (ε too small) or merge everything into one giant cluster (ε too large). Use HDBSCAN instead. :::
Interview Q&A
Q1: What are the three point types in DBSCAN and how are they defined?
Core points have at least MinPts neighbors within radius ε - including themselves - and form the dense interior of clusters. They drive cluster expansion: when a core point is processed, all points in its ε-neighborhood are added to the cluster. Border points are within distance ε of at least one core point but do not themselves have MinPts neighbors - they are on the edge of a cluster and are absorbed into it but cannot expand it further. Noise points (labeled −1) are neither core points nor within ε of any core point - they are isolated, low-density points that do not belong to any cluster. The critical asymmetry: directly density-reachable is not symmetric - is reachable from core point , but is not necessarily reachable from non-core .
Q2: How does DBSCAN handle varying cluster densities, and what algorithm solves this?
DBSCAN uses global parameters ε and MinPts, so a single ε cannot capture both dense and sparse clusters simultaneously. Setting ε small enough for a dense urban cluster means the sparse rural cluster's points won't have MinPts neighbors and will be labeled noise. Setting ε large enough for the sparse cluster will merge everything in the dense region into one giant component. DBSCAN fundamentally assumes uniform density within and between clusters. HDBSCAN solves this by computing mutual reachability distances, building a minimum spanning tree, and extracting clusters based on their stability across a full range of density thresholds. Each cluster is extracted at the scale where it is most stable - dense clusters at small scales, sparse clusters at large scales - all in a single run.
Q3: How do you choose ε for DBSCAN in practice?
Use the k-distance (k-NN distance) plot: compute the distance to the MinPts-th nearest neighbor for each point, sort these distances in descending order, and plot them. The curve has a sharp "elbow" where it transitions from high distances (sparse, noise region) to low distances (dense cluster region). Set ε at the elbow value. The elbow represents the natural density boundary in the data. If the elbow is ambiguous, grid-search over candidate ε values in the range suggested by the k-distance plot, score by silhouette on non-noise points, and apply constraints: reject candidates where the noise fraction exceeds 30–40% (too much noise means ε is too small) or where fewer than 2 clusters are found (ε is too large). Always standardize the data first - ε is in standardized distance units.
Q4: Why is DBSCAN naturally suited for anomaly detection while K-means is not?
K-means assigns every point to a cluster - there is no concept of an outlier. Points far from all centroids still get assigned to the nearest centroid, making them indistinguishable from normal cluster members without additional thresholding. DBSCAN explicitly labels low-density points as noise (−1), which is exactly the definition of an anomaly: a point that does not belong to any dense region. This makes DBSCAN noise labels directly usable as anomaly predictions with no additional thresholding logic. Furthermore, DBSCAN requires no labeled anomaly data - it is entirely unsupervised - making it applicable to novel anomaly types that have never been seen before. For production, extract core points after training and classify new points by their distance to the nearest core point: beyond ε → anomaly.
Q5: What is the time complexity of DBSCAN, and how do you scale it to millions of points?
With a spatial index (ball tree or k-d tree), DBSCAN runs in - each range query takes amortized, and there are queries. Without a spatial index, it is (brute-force pairwise distances). In high dimensions (), spatial indices degrade toward due to the curse of dimensionality. For production at scale: (1) reduce dimensions with PCA or UMAP before running DBSCAN, (2) use algorithm='ball_tree' in sklearn with leaf_size=30–50, (3) for , run DBSCAN on a representative sample (random or reservoir sampling), extract core points, and classify the full population by distance to core points - O(1) per new point using the prebuilt nearest-neighbor index, (4) for distributed scale, use HDBSCAN with approximate nearest neighbors (FAISS or ANNOY for neighbor graph construction).
Q6: Explain the HDBSCAN stability score and why it produces better clusters than OPTICS.
HDBSCAN assigns each cluster a stability score: , where . is the density threshold at which cluster split off from its parent, and is the threshold at which point fell out of (either left as noise or the cluster split). A cluster is "stable" if points remain in it across a wide range of density thresholds - wide "valley" in the reachability plot. HDBSCAN compares the stability of each cluster to the sum of its children's stabilities: if the children are collectively more stable, it splits the cluster; otherwise it keeps it as a flat cluster. This procedure automatically selects the right scale for each region of the data - unlike OPTICS, which produces the hierarchy but requires a manual cut to extract clusters.
Q7: You have a 50-dimensional dataset with 500,000 points. How would you approach density-based clustering?
First, apply dimensionality reduction: use UMAP (not PCA for non-linear structure) to reduce from 50D to 10–15D while preserving local topology. UMAP is and preserves the density structure needed by HDBSCAN. Then run HDBSCAN on the reduced data: it handles varying densities and the 500K scale is manageable in ~30 seconds with the fast C++ implementation. Set min_cluster_size based on the smallest meaningful cluster in your domain (e.g., if you care about groups of at least 100 users, set it to 100). Use prediction_data=True to enable fast inference on new points via hdbscan.approximate_predict(). Monitor cluster stability: if more than 40% of points are labeled noise, reduce min_cluster_size or re-examine the UMAP hyperparameters - the dimensionality reduction may not be preserving the density structure.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the DBSCAN Density Clustering demo on the EngineersOfAI Playground - no code required.
:::
