Algorithms like K-Means struggle when clusters are not nicely spherical or have varying densities. DBSCAN is a completely different kind of clustering algorithm that excels in these scenarios. It groups points together based on density.
The core idea is simple: A cluster is a dense region of points, separated from other dense regions by areas of lower point density.
Key Concepts and Parameters
DBSCAN's logic revolves around two simple parameters and three point classifications:
Parameters:
- eps (epsilon, ε): This is a radius. It defines the "neighborhood" around a data point.
- min_samples: The minimum number of data points (including the point itself) that must be within a point's eps radius for it to be considered a dense region.
Point Classifications:
- Core Point: A point that has at least min_samples within its eps neighborhood. These are the hearts of clusters.
- Border Point: A point that is reachable from a core point (i.e., within its eps neighborhood) but does not have min_samples in its own neighborhood. These are the edges of clusters.
- Noise Point (Outlier): A point that is neither a core point nor a border point. It lies alone in a sparse region.
[Image illustrating core, border, and noise points in DBSCAN]
The DBSCAN Algorithm
- The algorithm picks an arbitrary, unvisited data point.
- It checks if this point is a core point.
- If it is, a new cluster is started. This cluster is expanded by finding all connected core points and their respective border points. This process is called "density-reachability."
- If the chosen point is not a core point, it is temporarily labeled as noise. It might later be found by another cluster and become a border point.
- Repeat the process until all points have been visited.
The great advantages of DBSCAN are that it can find clusters of any shape and it naturally identifies outliers. Its main disadvantage is that it can struggle with clusters of varying densities.
Python
# Python code with scikit-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Generate non-spherical data (two interleaving half circles)
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
# Apply DBSCAN
# Choosing eps and min_samples can require some experimentation
dbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)
# The cluster label -1 is assigned to noise points by scikit-learn
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='plasma', s=50)
plt.title(f'DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
# You can also check the number of clusters and noise points
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)
print(f'Estimated number of clusters: {n_clusters}')
print(f'Estimated number of noise points: {n_noise}')