Master DBSCAN, OPTICS, HDBSCAN, and Mean Shift - density-based clustering algorithms that discover arbitrarily shaped clusters, handle varying densities, and identify anomalies without specifying the number of clusters.

How does HDBSCAN work in practice?

DBSCAN and Density-Based Clustering covers DBSCAN, HDBSCAN, OPTICS from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/unsupervised-learning/dbscan-and-density-methods

What is the difference between DBSCAN and OPTICS?

See the full breakdown at https://engineersofai.com/docs/ml/unsupervised-learning/dbscan-and-density-methods

DBSCAN and Density-Based Clustering

The Real Interview Moment

Your fraud detection team receives GPS coordinates for every card transaction - latitude, longitude, transaction amount, and time of day. You need to flag unusual transaction locations automatically, without knowing how many "home clusters" each user has. The clusters (cities where a user regularly transacts) have arbitrary shapes: urban sprawl, commute corridors, vacation patterns. They vary vastly in density - Manhattan has thousands of transactions in a square mile, rural Iowa has one per hundred square miles. And you expect noise: airport transactions, one-off hotel payments, road trips.

You bring K-means to the first team meeting. An engineer asks: "What K do we use per user?" You freeze. A user who lives in New York but travels to Chicago monthly needs K=2. A user who lives in one city their whole life needs K=1. A traveler needs K=8. K-means requires you to specify K in advance - but the number of legitimate clusters varies per user and is unknown. Worse, K-means will force every outlier transaction into a cluster, even the fraudulent midnight charge in Moscow.

DBSCAN handles all of this naturally. It identifies high-density regions as clusters without requiring K to be specified. It labels low-density isolated points as noise - directly interpretable as anomalies. It makes no assumptions about cluster shapes or sizes. And it extends naturally to HDBSCAN, which handles the variable-density problem between Manhattan and rural Iowa.

This lesson covers the full density-based clustering family - DBSCAN, OPTICS, HDBSCAN, and Mean Shift - with complete implementations, parameter selection strategies, and production anomaly detection patterns.

Why K-Means Fails Here: The Shape Problem

Before DBSCAN (1996), all major clustering algorithms assumed clusters had regular, convex geometry. K-means minimizes within-cluster sum of squares, which produces Voronoi-cell boundaries - always convex, always linear. The assignment rule is "nearest centroid," which means every cluster is a hyperball around its center.

This fails in three specific ways that matter in practice:

Non-convex shapes: Geographic clusters follow roads, coastlines, and city boundaries - crescent shapes, elongated corridors, concentric rings. K-means cuts through them with straight lines because it cannot represent non-convex cluster boundaries.

No noise handling: K-means assigns every point to a cluster. A transaction in the middle of the Pacific Ocean still gets assigned to "cluster 3" because that centroid is nearest. There is no concept of an outlier.

Uniform density assumption: K-means assumes clusters have roughly equal density because it uses a global ε-equivalent (the centroid distance) to decide membership. It cannot simultaneously capture both a dense urban cluster and a sparse rural cluster.

DBSCAN solves all three by redefining what a "cluster" means: not a region around a centroid, but a maximally connected dense region. Two points belong to the same cluster if there is a chain of dense neighborhoods connecting them - regardless of path shape.

Historical Context

1996: Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu publish "A density-based algorithm for discovering clusters in large spatial databases with noise" at KDD. The paper introduces DBSCAN and the three-way classification of points as core, border, and noise. It won the KDD 2014 Test of Time Award - rare recognition that an algorithm aged extremely well.

1999: The same group publishes OPTICS (Ordering Points To Identify the Clustering Structure), which addresses DBSCAN's inability to handle clusters of varying density by building an ordered reachability plot.

2017: Leland McInnes and John Healy publish HDBSCAN (Hierarchical DBSCAN), which extends OPTICS with a principled cluster extraction strategy based on stability. This is now the recommended default for density-based clustering in most applications.

The key insight in DBSCAN: instead of defining clusters by their central points (K-means, GMM), define them by density connectivity - a point belongs to a cluster if it is reachable through a chain of dense regions. This single shift in definition unlocks arbitrary cluster shapes, natural noise handling, and no requirement to specify K.

DBSCAN: Core Concepts

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions of points separated by sparse regions. It requires two parameters:

ε (epsilon): the neighborhood radius. A point's ε-neighborhood $N_\varepsilon(p)$ is the set of all points within distance ε from $p$ .
MinPts (min_samples): the minimum number of points required for a region to be considered "dense."

Three Point Types

Every point is classified as exactly one of three types:

Core point: A point $p$ is a core point if it has at least MinPts points within radius ε, including itself:

$|N_\varepsilon(p)| \geq \text{MinPts}$

Core points are the "interior" of clusters - they are surrounded by sufficient density to form the skeleton of a cluster.

Border point: A point $q$ is a border point if it is within distance ε of a core point (directly reachable from a core), but does not itself have MinPts neighbors. Border points are on the edge of clusters - they belong to a cluster but cannot propagate it further.

Noise point: A point that is neither a core point nor within distance ε of any core point. These are true outliers - DBSCAN labels them −1. They are isolated, low-density points that do not belong to any cluster.

Visualization: ε = 1.5, MinPts = 3

   Points:        Classification:

   * * *          C C C          (core points - 3+ neighbors each)
  * . * . *      C B C B C      (. = border, C = core)
   * * *          C C C

      *                N         (isolated → noise)

       * * *          C C C      (second cluster, density-connected)
      * * *          C C C

Density-Reachability and Connectivity

DBSCAN builds clusters through the concept of density-connectivity:

Directly density-reachable: $q$ is directly density-reachable from core point $p$ if $q \in N_\varepsilon(p)$ . Note: this is not symmetric - $p$ must be a core point, but $q$ need not be.

Density-reachable: $q$ is density-reachable from $p$ if there exists a chain $p_1, p_2, \ldots, p_n$ where $p_1 = p$ , $p_n = q$ , and each $p_{i+1}$ is directly density-reachable from core point $p_i$ .

Density-connected: $p$ and $q$ are density-connected if there exists a point $o$ such that both $p$ and $q$ are density-reachable from $o$ .

A cluster is a maximal set of density-connected points. This definition allows clusters of any shape - any connected dense region forms one cluster, regardless of its geometry. Two elongated arcs that share a bridge of dense points will be merged into one cluster.

The DBSCAN Algorithm

DBSCAN(X, ε, MinPts):
  cluster_id = 0
  labels = [-1] * len(X)   # -1 = unvisited / noise

  for each point p in X:
    if labels[p] != -1: continue   # already assigned

    neighbors = RangeQuery(X, p, ε)   # all points within ε of p

    if len(neighbors) < MinPts:
      labels[p] = -1   # mark as noise (may be relabeled later)
      continue

    # p is a core point - start new cluster
    cluster_id += 1
    labels[p] = cluster_id

    seed_set = neighbors - {p}   # points to expand from

    while seed_set is not empty:
      q = seed_set.pop()

      if labels[q] == -1:
        labels[q] = cluster_id   # relabel noise as border point

      if labels[q] != -1: continue  # already in a cluster

      labels[q] = cluster_id

      q_neighbors = RangeQuery(X, q, ε)
      if len(q_neighbors) >= MinPts:   # q is also a core point
        seed_set = seed_set ∪ q_neighbors   # expand from q

  return labels

The key property: when the algorithm reaches a core point, it floods all density-reachable points into the same cluster via the seed_set. Border points are absorbed when a core point's neighborhood includes them, but they do not themselves expand the cluster.

Time Complexity Analysis

The bottleneck is the RangeQuery - finding all points within distance ε of a given point.

Naive approach (O(n²)): For each of $n$ points, scan all $n$ points and check distance. Total: $O(n^2)$ .

With spatial index (O(n log n)): Build a ball tree or k-d tree on the data. Each range query runs in $O(\log n)$ amortized. Since we perform at most $n$ range queries, total cost is $O(n \log n)$ .

The spatial index is critical for performance. Sklearn's DBSCAN automatically selects ball_tree for low-dimensional data and brute force for high dimensions or small datasets.

High-dimensional degradation: Spatial indices lose effectiveness when $d > 20$ - in high dimensions, all points are approximately equidistant (curse of dimensionality) and the index provides little speedup. For high-dimensional data, apply dimensionality reduction (PCA, UMAP) before DBSCAN.

NumPy DBSCAN Implementation from Scratch

import numpy as np
from collections import deque

def range_query(X: np.ndarray, point_idx: int, eps: float) -> list:
    """
    Find all points within Euclidean distance eps of X[point_idx].
    O(n) naive scan - for production use a BallTree or KD-Tree.
    """
    point = X[point_idx]
    # Vectorized distance computation
    diffs = X - point                          # (n, d)
    sq_dists = np.sum(diffs ** 2, axis=1)      # (n,)
    return list(np.where(sq_dists <= eps ** 2)[0])


def dbscan_scratch(X: np.ndarray, eps: float,
                   min_samples: int) -> np.ndarray:
    """
    DBSCAN from scratch using the flooding algorithm.

    Parameters
    ----------
    X : (n, d) data array
    eps : neighborhood radius
    min_samples : minimum points for a core region (including self)

    Returns
    -------
    labels : (n,) array, -1 = noise, 0..K-1 = cluster ids
    """
    n = len(X)
    labels = np.full(n, -2, dtype=int)   # -2 = unvisited
    cluster_id = -1

    for i in range(n):
        if labels[i] != -2:
            continue   # already processed

        neighbors = range_query(X, i, eps)

        if len(neighbors) < min_samples:
            labels[i] = -1   # noise point (may be relabeled)
            continue

        # Point i is a core point - start new cluster
        cluster_id += 1
        labels[i] = cluster_id

        # BFS expansion
        seed_set = deque(neighbors)
        while seed_set:
            j = seed_set.popleft()

            if labels[j] == -1:
                # Was noise - relabel as border of this cluster
                labels[j] = cluster_id

            if labels[j] != -2:
                continue   # already in a cluster

            labels[j] = cluster_id
            j_neighbors = range_query(X, j, eps)

            if len(j_neighbors) >= min_samples:
                # j is also a core point - expand from it
                seed_set.extend(j_neighbors)

    # Remaining -2 (unvisited) should not exist - all points processed
    assert (labels == -2).sum() == 0, "Some points were not visited!"
    return labels


# Test on make_moons
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X_moons, y_true = make_moons(n_samples=300, noise=0.08, random_state=42)
X_scaled = StandardScaler().fit_transform(X_moons)

labels_scratch = dbscan_scratch(X_scaled, eps=0.3, min_samples=5)
n_clusters = labels_scratch.max() + 1
n_noise = (labels_scratch == -1).sum()
print(f"From scratch: {n_clusters} clusters, {n_noise} noise points")

# Verify against sklearn
from sklearn.cluster import DBSCAN
labels_sklearn = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_scaled)
print(f"Sklearn:      {labels_sklearn.max()+1} clusters, "
      f"{(labels_sklearn==-1).sum()} noise points")

Sklearn DBSCAN with Parameter Tuning

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs, make_circles
from sklearn.preprocessing import StandardScaler

# --- Example 1: Non-spherical clusters (K-means fails, DBSCAN wins) ---
X_moons, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

dbscan = DBSCAN(eps=0.2, min_samples=5, metric='euclidean', algorithm='ball_tree')
labels = dbscan.fit_predict(X_moons)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise} ({100*n_noise/len(labels):.1f}%)")
print(f"Core sample count: {len(dbscan.core_sample_indices_)}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Moons
colors = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels]
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=colors, s=30, alpha=0.7)
axes[0].set_title(f"DBSCAN on Moons - {n_clusters} clusters, {n_noise} noise")

# Concentric circles
X_circles, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
labels_circles = DBSCAN(eps=0.15, min_samples=5).fit_predict(X_circles)
n_clusters_circles = len(set(labels_circles)) - (1 if -1 in labels_circles else 0)
colors_circles = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels_circles]
axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=colors_circles, s=30, alpha=0.7)
axes[1].set_title(f"DBSCAN on Circles - {n_clusters_circles} clusters")

# Blobs with outliers
X_blobs, _ = make_blobs(n_samples=200, centers=3, cluster_std=0.5, random_state=42)
outliers = np.random.default_rng(42).uniform(-8, 8, size=(30, 2))
X_with_outliers = np.vstack([X_blobs, outliers])
labels_blobs = DBSCAN(eps=0.8, min_samples=5).fit_predict(X_with_outliers)
n_noise_blobs = (labels_blobs == -1).sum()
colors_blobs = ['red' if l == -1 else plt.cm.tab10(l/10) for l in labels_blobs]
axes[2].scatter(X_with_outliers[:, 0], X_with_outliers[:, 1],
                c=colors_blobs, s=30, alpha=0.7)
axes[2].set_title(f"DBSCAN on Blobs+Outliers - {n_noise_blobs} noise (red)")

plt.suptitle("DBSCAN on Various Cluster Shapes", fontsize=14)
plt.tight_layout()
plt.show()

Choosing ε and MinPts: The k-Distance Graph

This is the central practical challenge with DBSCAN. Poor parameter choices lead to degenerate results: too-small ε makes everything noise; too-large ε merges all clusters into one.

Choosing MinPts

Rule of thumb: MinPts ≥ dimensionality + 1. For 2D data: MinPts = 3. For 10D data: MinPts ≥ 11. For noisy data, use higher values (10–20) to require denser regions.

The intuition: MinPts controls the minimum "resolution" of a cluster. Higher MinPts means you only identify very dense regions as cluster cores. This filters out low-density clusters but also makes the algorithm more robust to noise.

A secondary rule: MinPts = 2 always works (any pair of mutually ε-close points forms a cluster), but produces poor results. The practical recommendation is MinPts = 2 × dimensionality.

Choosing ε: The k-Distance Plot

The best approach for ε selection is the k-distance plot (also called the k-NN elbow plot):

For each point, compute the distance to its k-th nearest neighbor (where $k$ = MinPts − 1)
Sort these distances in descending order
Plot the sorted distances
Look for the "elbow" - the point where the sorted distances make a sharp bend

Points below the elbow have their k-th neighbor within a dense cluster. Points above the elbow are in sparse regions (noise). The ε value at the elbow is the natural density threshold.

from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt

def plot_knn_distances(X: np.ndarray, min_samples: int = 5,
                       n_candidates: int = 5):
    """
    Plot the k-NN distance graph for ε selection.

    The elbow in this plot is a good starting value for ε.

    Parameters
    ----------
    X : (n, d) data array (should be standardized first)
    min_samples : the MinPts parameter you plan to use
    n_candidates : number of candidate ε values to suggest

    Returns
    -------
    k_distances : sorted k-NN distances
    suggested_eps : estimated elbow value
    """
    k = min_samples - 1   # k-th nearest neighbor distance

    nbrs = NearestNeighbors(n_neighbors=k + 1, algorithm='ball_tree')
    nbrs.fit(X)
    distances, _ = nbrs.kneighbors(X)

    # Distance to k-th nearest neighbor (col k = k-th neighbor, col 0 = self)
    k_distances = np.sort(distances[:, k])[::-1]   # descending

    # Estimate elbow via maximum curvature
    # Second derivative of sorted k-distances
    second_deriv = np.diff(np.diff(k_distances))
    elbow_idx = np.argmax(np.abs(second_deriv)) + 1
    suggested_eps = k_distances[elbow_idx]

    plt.figure(figsize=(10, 5))
    plt.plot(k_distances, lw=2, color='steelblue')
    plt.axvline(x=elbow_idx, color='orange', linestyle='--', alpha=0.7)
    plt.axhline(y=suggested_eps, color='red', linestyle='--',
                label=f'Suggested ε ≈ {suggested_eps:.3f}')
    plt.xlabel("Points (sorted by k-NN distance, descending)")
    plt.ylabel(f"Distance to {k}-th nearest neighbor")
    plt.title(f"k-Distance Plot (k={k}, MinPts={min_samples})\n"
              f"Elbow → suggested ε = {suggested_eps:.3f}")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

    # Suggest a range of candidates around the elbow
    candidates = np.percentile(k_distances, np.linspace(5, 25, n_candidates))
    print(f"Candidate ε values: {candidates.round(3)}")
    print(f"Elbow estimate: {suggested_eps:.4f}")

    return k_distances, suggested_eps


# Always standardize before running DBSCAN
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_moons)
k_dists, eps_candidate = plot_knn_distances(X_scaled, min_samples=5)

Silhouette-Based Grid Search for ε

When the k-distance plot gives an ambiguous elbow, grid-search over candidate ε values:

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np

def tune_dbscan(X: np.ndarray, min_samples: int = 5,
                eps_range: np.ndarray = None):
    """
    Grid search over ε values, scored by silhouette on non-noise points.
    """
    if eps_range is None:
        eps_range = np.linspace(0.1, 1.5, 30)

    results = []
    for eps in eps_range:
        labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X)
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        noise_frac = (labels == -1).mean()

        # Only score if we have at least 2 clusters and not too much noise
        if n_clusters < 2 or noise_frac > 0.5:
            results.append({'eps': eps, 'n_clusters': n_clusters,
                            'noise_frac': noise_frac, 'silhouette': np.nan})
            continue

        # Compute silhouette only on non-noise points
        mask = labels != -1
        if mask.sum() < 2 * n_clusters:  # too few points per cluster
            results.append({'eps': eps, 'n_clusters': n_clusters,
                            'noise_frac': noise_frac, 'silhouette': np.nan})
            continue

        sil = silhouette_score(X[mask], labels[mask])
        results.append({'eps': eps, 'n_clusters': n_clusters,
                        'noise_frac': noise_frac, 'silhouette': sil})

    import pandas as pd
    results_df = pd.DataFrame(results)
    valid = results_df.dropna(subset=['silhouette'])

    if len(valid) == 0:
        print("No valid ε found - try a different range or min_samples")
        return results_df, None

    best = valid.loc[valid['silhouette'].idxmax()]
    print(f"Best ε = {best['eps']:.3f}: "
          f"{int(best['n_clusters'])} clusters, "
          f"{best['noise_frac']:.1%} noise, "
          f"silhouette = {best['silhouette']:.4f}")
    return results_df, best['eps']

results, best_eps = tune_dbscan(X_scaled, min_samples=5,
                                eps_range=np.linspace(0.05, 0.8, 40))

DBSCAN as an Anomaly Detector

Points labeled −1 (noise) are anomalies by definition - they do not belong to any dense cluster. This makes DBSCAN a natural unsupervised anomaly detection tool, requiring no labeled anomaly data.

The workflow: fit DBSCAN on normal operational data. Any new point that falls outside all existing dense clusters (distance to nearest core point exceeds ε) is flagged as an anomaly.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

# Simulate transaction data: lat, lon, amount, hour
np.random.seed(42)
n_legit = 2000
n_fraud = 40

# Legitimate transactions cluster around home city (NYC area)
legitimate = np.column_stack([
    np.random.normal(40.73, 0.06, n_legit),    # lat: NYC
    np.random.normal(-73.99, 0.06, n_legit),   # lon: NYC
    np.random.exponential(60, n_legit),         # amount: typical purchases
    np.random.choice(range(8, 22), n_legit),   # hour: business hours
])

# Add a second cluster (user travels to Chicago monthly)
chicago_visits = np.column_stack([
    np.random.normal(41.88, 0.04, 100),
    np.random.normal(-87.63, 0.04, 100),
    np.random.exponential(80, 100),
    np.random.choice(range(8, 22), 100),
])
legitimate = np.vstack([legitimate, chicago_visits])
n_legit = len(legitimate)

# Fraudulent: random US locations, unusual amounts, late night
fraud = np.column_stack([
    np.random.uniform(30, 48, n_fraud),
    np.random.uniform(-120, -70, n_fraud),
    np.random.exponential(800, n_fraud),        # large unusual amounts
    np.random.choice(range(0, 5), n_fraud),    # late night: 12am-5am
])

X = np.vstack([legitimate, fraud])
y_true = np.array([0]*n_legit + [1]*n_fraud)   # 0=legit, 1=fraud

# Scale: all features to comparable range
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Run DBSCAN
db = DBSCAN(eps=0.5, min_samples=10, algorithm='ball_tree')
labels = db.fit_predict(X_scaled)

# Noise points (-1) are anomaly candidates
predicted_anomalies = (labels == -1).astype(int)

from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
print(f"Precision: {precision_score(y_true, predicted_anomalies):.3f}")
print(f"Recall:    {recall_score(y_true, predicted_anomalies):.3f}")
print(f"F1:        {f1_score(y_true, predicted_anomalies):.3f}")
print(f"\nConfusion matrix:")
print(confusion_matrix(y_true, predicted_anomalies))

Production Anomaly Detection Pipeline

The key insight for production: fit DBSCAN offline to learn the "normal" cluster structure, then classify new points at inference time by their distance to existing core points - without re-running the full DBSCAN.

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
import numpy as np

class DBSCANAnomalyDetector:
    """
    Unsupervised anomaly detector using DBSCAN.
    Fit on normal baseline data; score new observations by proximity to core points.
    """

    def __init__(self, eps: float = 0.5, min_samples: int = 10):
        self.eps = eps
        self.min_samples = min_samples
        self.scaler = StandardScaler()
        self._dbscan = DBSCAN(eps=eps, min_samples=min_samples,
                              algorithm='ball_tree')
        self._core_nn = None
        self.is_fitted = False

    def fit(self, X_normal: np.ndarray) -> 'DBSCANAnomalyDetector':
        """
        Learn the density structure of normal data.

        Parameters
        ----------
        X_normal : (n, d) array of normal (non-anomalous) observations
        """
        X_scaled = self.scaler.fit_transform(X_normal)
        self._dbscan.fit(X_scaled)

        # Extract core sample positions for fast inference
        core_indices = self._dbscan.core_sample_indices_
        if len(core_indices) == 0:
            raise ValueError(
                "No core points found - try increasing eps or decreasing min_samples"
            )

        self._X_core = X_scaled[core_indices]   # (n_core, d)
        self._core_nn = NearestNeighbors(
            n_neighbors=1, algorithm='ball_tree', metric='euclidean'
        )
        self._core_nn.fit(self._X_core)

        self.is_fitted = True
        print(f"Fitted: {len(X_normal)} training points, "
              f"{len(core_indices)} core points, "
              f"{int(self._dbscan.labels_.max()) + 1} clusters")
        return self

    def anomaly_score(self, X: np.ndarray) -> np.ndarray:
        """
        Return anomaly scores: distance to nearest core point.
        Higher score = more anomalous.
        Score > eps → anomaly; Score <= eps → normal.
        """
        assert self.is_fitted, "Call fit() first"
        X_scaled = self.scaler.transform(X)
        distances, _ = self._core_nn.kneighbors(X_scaled)
        return distances[:, 0]   # (n,)

    def predict(self, X: np.ndarray,
                threshold: float = None) -> np.ndarray:
        """
        Return predictions: -1 = anomaly, 0 = normal.
        Default threshold = eps (same as DBSCAN definition).
        """
        threshold = threshold or self.eps
        scores = self.anomaly_score(X)
        return np.where(scores > threshold, -1, 0)

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Return anomaly probability as sigmoid of (score - eps) / eps.
        """
        scores = self.anomaly_score(X)
        # Map: score=0 → prob≈0, score=eps → prob≈0.5, score=2*eps → prob≈0.73
        z = (scores - self.eps) / (self.eps + 1e-8)
        return 1 / (1 + np.exp(-z))


# Usage
detector = DBSCANAnomalyDetector(eps=0.5, min_samples=10)
detector.fit(legitimate[:1000])   # fit on training set only

# Score new transactions - high score = anomaly
test_legit = legitimate[1000:1100]
test_fraud = fraud[:20]
X_test = np.vstack([test_legit, test_fraud])
y_test = np.array([0]*100 + [1]*20)

scores = detector.anomaly_score(X_test)
labels_pred = detector.predict(X_test)

from sklearn.metrics import roc_auc_score
print(f"ROC-AUC:   {roc_auc_score(y_test, scores):.3f}")
print(f"Precision: {precision_score(y_test, labels_pred == -1):.3f}")
print(f"Recall:    {recall_score(y_test, labels_pred == -1):.3f}")

OPTICS: Ordering Points to Identify Clustering Structure

OPTICS (Kriegel et al., 1999) extends DBSCAN to handle clusters of varying density - the core limitation of DBSCAN. Instead of committing to a single ε, OPTICS builds a reachability plot that encodes the hierarchical density structure at all scales simultaneously.

Core Concepts

Core distance of point $p$ : the distance to its MinPts-th nearest neighbor. If $p$ has fewer than MinPts neighbors within any radius, its core distance is undefined (it can never be a core point).

$\text{core-dist}(p) = d(p, N_{\text{MinPts}}(p))$

Reachability distance from core point $o$ to point $p$ :

$\text{reach-dist}(p, o) = \max(\text{core-dist}(o), d(o, p))$

This is the maximum of the distance between $o$ and $p$ , and the core distance of $o$ . It smooths over the density of the origin point.

OPTICS processes points in order of increasing reachability distance and produces a reachability plot. The plot has valleys (low reachability) corresponding to dense clusters and peaks (high reachability) corresponding to sparse regions between clusters. Clusters can be extracted by finding valleys at any horizontal threshold - this is equivalent to running DBSCAN at that ε level.

from sklearn.cluster import OPTICS
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Create data with three clusters of different densities
centers = [[0, 0], [5, 5], [10, 0]]
X_varied, _ = make_blobs(
    n_samples=[100, 50, 200],    # different sizes
    centers=centers,
    cluster_std=[0.3, 0.8, 0.5],  # different densities
    random_state=42
)

optics = OPTICS(
    min_samples=5,
    max_eps=np.inf,      # no upper bound on ε (full hierarchy)
    metric='euclidean',
    cluster_method='xi',  # extract clusters by xi-steep descent
    xi=0.05              # minimum steepness threshold
)
optics.fit(X_varied)

labels_optics = optics.labels_
n_clusters = labels_optics.max() + 1
print(f"OPTICS found {n_clusters} clusters, "
      f"{(labels_optics == -1).sum()} noise points")

# Reachability plot - the key visualization for OPTICS
reachability = optics.reachability_[optics.ordering_]
cluster_colors = ['steelblue', 'tomato', 'forestgreen', 'purple']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Reachability plot
for klass in range(n_clusters):
    Xk = reachability[labels_optics[optics.ordering_] == klass]
    axes[0].plot(Xk, '.', ms=4,
                 color=cluster_colors[klass % len(cluster_colors)])
Xn = reachability[labels_optics[optics.ordering_] == -1]
axes[0].plot(Xn, 'k.', ms=2, alpha=0.3, label='Noise')
axes[0].set_ylabel("Reachability Distance")
axes[0].set_title("OPTICS Reachability Plot\nValleys = clusters")
axes[0].legend()

# Cluster scatter
colors_optics = [cluster_colors[l % len(cluster_colors)]
                 if l != -1 else 'black'
                 for l in labels_optics]
axes[1].scatter(X_varied[:, 0], X_varied[:, 1],
                c=colors_optics, s=20, alpha=0.7)
axes[1].set_title("OPTICS Cluster Result - handles varying density")

plt.tight_layout()
plt.show()

HDBSCAN: Hierarchical DBSCAN

HDBSCAN (McInnes & Healy, 2017) is the modern state-of-the-art for density-based clustering. It extends OPTICS with a rigorous mathematical framework for cluster extraction based on cluster stability.

How HDBSCAN Works

Step 1 - Transform the space: Compute the mutual reachability distance:

$d_{\text{mreach}}(a, b) = \max(\text{core-dist}(a), \text{core-dist}(b), d(a, b))$

This transformation makes the space more uniform, reducing the impact of density variations.

Step 2 - Build a minimum spanning tree: Compute the minimum spanning tree (MST) of the complete graph with mutual reachability distances as edge weights. This is $O(n^2)$ naively but $O(n \log n)$ with Prim's algorithm on a k-NN graph.

Step 3 - Build a cluster hierarchy: Sort the MST edges by decreasing weight. Remove edges one at a time. Each removal splits a component - this is equivalent to running DBSCAN at every possible ε simultaneously. The result is a dendrogram.

Step 4 - Extract stable clusters: Assign each cluster a stability score $S(C)$ :

$S(C) = \sum_{x \in C} \left(\lambda_{\text{death}}(x) - \lambda_{\text{birth}}(C)\right)$

where $\lambda = 1/\varepsilon$ (inverse epsilon), $\lambda_{\text{birth}}$ is when the cluster formed, and $\lambda_{\text{death}}$ is when the point left the cluster (or the cluster split). Clusters with high stability are preserved; those whose children are more stable are replaced by their children.

This gives HDBSCAN a critical advantage: it automatically selects the scale at which to extract each cluster, enabling it to find clusters at different density levels simultaneously.

# pip install hdbscan
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Data with very different densities
X1, _ = make_blobs(n_samples=200, centers=[[0,0]], cluster_std=0.2, random_state=42)
X2, _ = make_blobs(n_samples=200, centers=[[5,5]], cluster_std=1.2, random_state=42)
X_varying = np.vstack([X1, X2])

# DBSCAN: must choose a single eps - will either miss X2 or merge X1 into noise
labels_dbscan = DBSCAN(eps=0.5, min_samples=5).fit_predict(X_varying)
print(f"DBSCAN:  {labels_dbscan.max()+1} clusters, "
      f"{(labels_dbscan==-1).sum()} noise")

# HDBSCAN: handles both densities automatically
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,    # smallest cluster to extract
    min_samples=5,          # density threshold (default = min_cluster_size)
    cluster_selection_method='eom',  # excess of mass (default, recommended)
    metric='euclidean',
    prediction_data=True    # enables soft clustering and new-point prediction
)
clusterer.fit(X_varying)
labels_hdb = clusterer.labels_
print(f"HDBSCAN: {labels_hdb.max()+1} clusters, "
      f"{(labels_hdb==-1).sum()} noise")

# HDBSCAN provides rich outputs beyond labels:
print(f"\nCluster membership probabilities (first 5):")
print(clusterer.probabilities_[:5].round(3))   # how confidently assigned

print(f"\nOutlier scores (first 5, higher = more anomalous):")
print(clusterer.outlier_scores_[:5].round(3))

# Top 10 most anomalous points
top_anomalies = np.argsort(clusterer.outlier_scores_)[-10:]
print(f"\nTop anomaly indices: {top_anomalies}")

# Predict new points (requires prediction_data=True)
X_new = np.random.randn(5, 2) * 2 + 2.5
new_labels, new_probs = hdbscan.approximate_predict(clusterer, X_new)
print(f"\nNew point labels: {new_labels}")
print(f"New point probs:  {new_probs.round(3)}")

HDBSCAN Parameter Guide

# The main parameter is min_cluster_size
# Rule of thumb: set to the smallest cluster size you care about

# Small min_cluster_size (5-10): many small clusters, more noise
clusterer_fine = hdbscan.HDBSCAN(min_cluster_size=5)
clusterer_fine.fit(X_varying)

# Large min_cluster_size (50-100): fewer, larger clusters
clusterer_coarse = hdbscan.HDBSCAN(min_cluster_size=50)
clusterer_coarse.fit(X_varying)

# min_samples (optional): controls how conservative core point definition is
# Higher min_samples → fewer core points → more noise (more conservative)
clusterer_conservative = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=10)
clusterer_conservative.fit(X_varying)

# cluster_selection_method:
# 'eom' (excess of mass): default, finds compact clusters
# 'leaf': always takes the leaf clusters - produces many small clusters
clusterer_leaf = hdbscan.HDBSCAN(min_cluster_size=15,
                                  cluster_selection_method='leaf')
clusterer_leaf.fit(X_varying)

for name, c in [('fine', clusterer_fine),
                ('coarse', clusterer_coarse),
                ('conservative', clusterer_conservative),
                ('leaf', clusterer_leaf)]:
    n_cl = c.labels_.max() + 1
    n_noise = (c.labels_ == -1).sum()
    print(f"{name:14s}: {n_cl} clusters, {n_noise} noise")

Mean Shift Clustering

Mean Shift is a kernel density estimation approach to clustering that finds modes (local maxima) of the data density. Unlike DBSCAN, it requires no MinPts parameter - only a bandwidth $h$ (analogous to ε but for the kernel).

Algorithm

For each point $x$ :

Compute the weighted mean of all points within bandwidth $h$ : $m(x) = \frac{\sum_{x_i \in N_h(x)} K\left(\frac{x - x_i}{h}\right) x_i}{\sum_{x_i \in N_h(x)} K\left(\frac{x - x_i}{h}\right)}$ where $K$ is a kernel function (typically Gaussian or flat/uniform).
Update $x \leftarrow m(x)$ - shift the point toward the weighted mean.
Repeat until convergence. Each point converges to a local density maximum.
Points that converge to the same mode are assigned to the same cluster.

from sklearn.cluster import MeanShift, estimate_bandwidth
import numpy as np
from sklearn.datasets import make_blobs

X_ms, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# Estimate bandwidth using the rule of thumb: 0.3 quantile of pairwise distances
bandwidth = estimate_bandwidth(X_ms, quantile=0.2, n_samples=500)
print(f"Estimated bandwidth: {bandwidth:.3f}")

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X_ms)

labels_ms = ms.labels_
centers_ms = ms.cluster_centers_
print(f"Clusters found: {len(np.unique(labels_ms))}")
print(f"Cluster centers:\n{centers_ms.round(2)}")

import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.scatter(X_ms[:, 0], X_ms[:, 1], c=labels_ms, cmap='tab10',
            s=20, alpha=0.6)
plt.scatter(centers_ms[:, 0], centers_ms[:, 1],
            c='red', marker='X', s=200, zorder=5, label='Mode centers')
plt.title(f"Mean Shift Clustering - {len(np.unique(labels_ms))} clusters")
plt.legend()
plt.show()

Mean Shift advantages: it automatically determines K (one cluster per mode), handles arbitrary shapes, and naturally produces cluster centers that are actual data density peaks. Disadvantage: $O(n^2)$ per iteration - slow for large datasets. Use HDBSCAN instead for large-scale work.

Algorithm Comparison

Property	K-Means	DBSCAN	GMM	HDBSCAN
Specify K	Required	Not required	Required	Not required
Cluster shape	Spherical (Voronoi)	Arbitrary	Elliptical	Arbitrary
Varying density	Fails	Fails	Soft handling	Handles well
Outlier handling	None (all assigned)	Hard noise labels	Soft (low prob)	Soft outlier scores
Time complexity	$O(nKdT)$	$O(n \log n)$	$O(nKd^2T)$	$O(n \log n)$
Soft assignments	No	No	Yes	Yes (probabilities)
Parameter sensitivity	Medium (K selection)	High (ε critical)	Medium	Low (robust)
Scalability	Excellent	Good	Moderate	Good
Interpretability	High (centroids)	Medium	Medium	Medium

YouTube Resources

Title	Channel	Why Watch
StatQuest: DBSCAN	StatQuest with Josh Starmer	Clear visual walkthrough of core/border/noise classification
HDBSCAN Explained	Leland McInnes (author)	The algorithm's creator explains mutual reachability and stability
Clustering Algorithms Comparison	Normalized Nerd	Side-by-side comparison of K-means, DBSCAN, and GMM on various datasets
Anomaly Detection with DBSCAN	Python Engineer	Production anomaly detection patterns with sklearn
OPTICS Algorithm Explained	ritvikmath	Reachability plots and hierarchical cluster extraction

Common Mistakes

:::danger Forgetting to Standardize Before DBSCAN DBSCAN uses Euclidean distance. A feature with range 0–100,000 (income) will dominate over a feature with range 0–1 (probability). Always StandardScaler().fit_transform(X) before DBSCAN. This is not optional - it changes which points are considered neighbors and can completely change the cluster structure. :::

:::danger Using DBSCAN for Uniformly Dense Spherical Data If your data genuinely has well-separated spherical clusters of similar density, K-means is a better choice - it is faster, easier to tune, and produces interpretable centroids. DBSCAN's advantages (arbitrary shapes, noise handling) are irrelevant in that case, and its sensitivity to ε becomes a liability. Choose the right tool for the data geometry. :::

:::warning DBSCAN Cannot Predict New Points After fitting DBSCAN with fit_predict(), there is no predict() method - DBSCAN does not produce centroids or density models that generalize to new points. For production use where new points must be classified, you must: (1) extract core points after training, (2) build a nearest-neighbor index over core points, (3) classify new points by whether their nearest core point is within ε. This is exactly the DBSCANAnomalyDetector pattern above. :::

:::warning DBSCAN Fails on Clusters of Varying Density If your clusters have dramatically different densities - an urban cluster with 1000 points/km² and a rural cluster with 2 points/km² - a single ε cannot simultaneously capture both. The urban cluster needs a small ε; the rural cluster needs a large one. DBSCAN will either label the rural points as noise (ε too small) or merge everything into one giant cluster (ε too large). Use HDBSCAN instead. :::

Interview Q&A

Q1: What are the three point types in DBSCAN and how are they defined?

Core points have at least MinPts neighbors within radius ε - including themselves - and form the dense interior of clusters. They drive cluster expansion: when a core point is processed, all points in its ε-neighborhood are added to the cluster. Border points are within distance ε of at least one core point but do not themselves have MinPts neighbors - they are on the edge of a cluster and are absorbed into it but cannot expand it further. Noise points (labeled −1) are neither core points nor within ε of any core point - they are isolated, low-density points that do not belong to any cluster. The critical asymmetry: directly density-reachable is not symmetric - $q$ is reachable from core point $p$ , but $p$ is not necessarily reachable from non-core $q$ .

Q2: How does DBSCAN handle varying cluster densities, and what algorithm solves this?

DBSCAN uses global parameters ε and MinPts, so a single ε cannot capture both dense and sparse clusters simultaneously. Setting ε small enough for a dense urban cluster means the sparse rural cluster's points won't have MinPts neighbors and will be labeled noise. Setting ε large enough for the sparse cluster will merge everything in the dense region into one giant component. DBSCAN fundamentally assumes uniform density within and between clusters. HDBSCAN solves this by computing mutual reachability distances, building a minimum spanning tree, and extracting clusters based on their stability across a full range of density thresholds. Each cluster is extracted at the scale where it is most stable - dense clusters at small scales, sparse clusters at large scales - all in a single run.

Q3: How do you choose ε for DBSCAN in practice?

Use the k-distance (k-NN distance) plot: compute the distance to the MinPts-th nearest neighbor for each point, sort these distances in descending order, and plot them. The curve has a sharp "elbow" where it transitions from high distances (sparse, noise region) to low distances (dense cluster region). Set ε at the elbow value. The elbow represents the natural density boundary in the data. If the elbow is ambiguous, grid-search over candidate ε values in the range suggested by the k-distance plot, score by silhouette on non-noise points, and apply constraints: reject candidates where the noise fraction exceeds 30–40% (too much noise means ε is too small) or where fewer than 2 clusters are found (ε is too large). Always standardize the data first - ε is in standardized distance units.

Q4: Why is DBSCAN naturally suited for anomaly detection while K-means is not?

K-means assigns every point to a cluster - there is no concept of an outlier. Points far from all centroids still get assigned to the nearest centroid, making them indistinguishable from normal cluster members without additional thresholding. DBSCAN explicitly labels low-density points as noise (−1), which is exactly the definition of an anomaly: a point that does not belong to any dense region. This makes DBSCAN noise labels directly usable as anomaly predictions with no additional thresholding logic. Furthermore, DBSCAN requires no labeled anomaly data - it is entirely unsupervised - making it applicable to novel anomaly types that have never been seen before. For production, extract core points after training and classify new points by their distance to the nearest core point: beyond ε → anomaly.

Q5: What is the time complexity of DBSCAN, and how do you scale it to millions of points?

With a spatial index (ball tree or k-d tree), DBSCAN runs in $O(n \log n)$ - each range query takes $O(\log n)$ amortized, and there are $O(n)$ queries. Without a spatial index, it is $O(n^2)$ (brute-force pairwise distances). In high dimensions ( $d > 20$ ), spatial indices degrade toward $O(n^2)$ due to the curse of dimensionality. For production at scale: (1) reduce dimensions with PCA or UMAP before running DBSCAN, (2) use algorithm='ball_tree' in sklearn with leaf_size=30–50, (3) for $n > 500{,}000$ , run DBSCAN on a representative sample (random or reservoir sampling), extract core points, and classify the full population by distance to core points - O(1) per new point using the prebuilt nearest-neighbor index, (4) for distributed scale, use HDBSCAN with approximate nearest neighbors (FAISS or ANNOY for neighbor graph construction).

Q6: Explain the HDBSCAN stability score and why it produces better clusters than OPTICS.

HDBSCAN assigns each cluster a stability score: $S(C) = \sum_{x \in C} (\lambda_{\text{death}}(x) - \lambda_{\text{birth}}(C))$ , where $\lambda = 1/\varepsilon$ . $\lambda_{\text{birth}}$ is the density threshold at which cluster $C$ split off from its parent, and $\lambda_{\text{death}}(x)$ is the threshold at which point $x$ fell out of $C$ (either left as noise or the cluster split). A cluster is "stable" if points remain in it across a wide range of density thresholds - wide "valley" in the reachability plot. HDBSCAN compares the stability of each cluster to the sum of its children's stabilities: if the children are collectively more stable, it splits the cluster; otherwise it keeps it as a flat cluster. This procedure automatically selects the right scale for each region of the data - unlike OPTICS, which produces the hierarchy but requires a manual cut to extract clusters.

Q7: You have a 50-dimensional dataset with 500,000 points. How would you approach density-based clustering?

First, apply dimensionality reduction: use UMAP (not PCA for non-linear structure) to reduce from 50D to 10–15D while preserving local topology. UMAP is $O(n \log n)$ and preserves the density structure needed by HDBSCAN. Then run HDBSCAN on the reduced data: it handles varying densities and the 500K scale is manageable in ~30 seconds with the fast C++ implementation. Set min_cluster_size based on the smallest meaningful cluster in your domain (e.g., if you care about groups of at least 100 users, set it to 100). Use prediction_data=True to enable fast inference on new points via hdbscan.approximate_predict(). Monitor cluster stability: if more than 40% of points are labeled noise, reduce min_cluster_size or re-examine the UMAP hyperparameters - the dimensionality reduction may not be preserving the density structure.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the DBSCAN Density Clustering demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why K-Means Fails Here: The Shape Problem​

Historical Context​

DBSCAN: Core Concepts​

Three Point Types​

Density-Reachability and Connectivity​

The DBSCAN Algorithm​

Time Complexity Analysis​

NumPy DBSCAN Implementation from Scratch​

Sklearn DBSCAN with Parameter Tuning​

Choosing ε and MinPts: The k-Distance Graph​

Choosing MinPts​

Choosing ε: The k-Distance Plot​

Silhouette-Based Grid Search for ε​

DBSCAN as an Anomaly Detector​

Production Anomaly Detection Pipeline​

OPTICS: Ordering Points to Identify Clustering Structure​

Core Concepts​

HDBSCAN: Hierarchical DBSCAN​

How HDBSCAN Works​

HDBSCAN Parameter Guide​

Mean Shift Clustering​

Algorithm​

Algorithm Comparison​

YouTube Resources​

Common Mistakes​

Interview Q&A​