Unlike K-Means, Hierarchical Clustering doesn't require you to specify the number of clusters beforehand. Instead, it builds a multi-level hierarchy of clusters, which is often more informative.

There are two main approaches:

  1. Agglomerative (Bottom-Up): This is the most common approach. It starts with each data point as its own individual cluster and then iteratively merges the closest pairs of clusters until only one cluster (containing all the data) remains.
  2. Divisive (Top-Down): This is less common. It starts with all data points in a single cluster and recursively splits them until each point is its own cluster.

We'll focus on the agglomerative method.

The Agglomerative Process

The algorithm is as follows:

  1. Start: Treat each data point as a single-point cluster.
  2. Merge: Find the two "closest" clusters and merge them into a single new cluster.
  3. Repeat: Repeat the merge step until all points are in a single, all-encompassing cluster.

But how do we define the "closest" clusters? This is determined by the linkage criterion:

  • Single Linkage: The distance between two clusters is the distance between the closest two points in the different clusters. Can be sensitive to outliers.
  • Complete Linkage: The distance is defined by the farthest two points in the different clusters. Less susceptible to noise.
  • Average Linkage: The distance is the average of the distances between all pairs of points across the two clusters.
  • Ward's Linkage: The distance is the increase in the total within-cluster variance that would result from merging the two clusters. It's a popular, robust choice.

Visualizing with a Dendrogram

The real power of hierarchical clustering comes from its visualization: the dendrogram. This tree-like diagram shows the entire history of merges.

How to read a dendrogram:

  • The y-axis represents the distance (or dissimilarity) between clusters.
  • The x-axis represents the individual data points.
  • Horizontal lines are merges. The height of the horizontal line indicates the distance at which the two clusters were merged.
  • Long vertical lines indicate that a cluster persisted for a long time without being merged.

You can choose the final number of clusters by "cutting" the dendrogram at a certain height. The number of vertical lines your horizontal cut intersects is the number of clusters you will get.

Python


# Python code with SciPy
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=25, centers=3, n_features=2, cluster_std=1.0, random_state=42)

# Perform hierarchical/agglomerative clustering using Ward's linkage
# 'linkage' returns a matrix that encodes the merges
linked = linkage(X, method='ward')

# Plot the dendrogram
plt.figure(figsize=(12, 7))
dendrogram(linked,
           orientation='top',
           labels=range(1, len(X) + 1),
           distance_sort='descending',
           show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plt.xlabel('Data Point Index')
plt.ylabel('Distance')
plt.show()