K-Means Clustering: Finding Groups in Your Data

Unsupervised learning is all about finding hidden patterns in data without pre-existing labels. Clustering is a primary task in this domain, and K-Means is the go-to algorithm for many. Its goal is to group similar data points together into a specified number of clusters, 'k'.

How K-Means Works

The algorithm is surprisingly intuitive. It iterates through a simple two-step process to find the best cluster assignments:

Initialization: First, randomly select 'k' data points from your dataset to act as the initial centroids (the center of a cluster).
Assignment Step: For each data point, calculate its distance (usually Euclidean distance) to every centroid. Assign the data point to the cluster of the nearest centroid.
Update Step: After assigning all points, recalculate the position of each of the 'k' centroids. The new centroid position is the mean of all data points assigned to that cluster.
Repeat: Keep repeating the Assignment and Update steps until the centroids no longer move significantly. At this point, the algorithm has converged, and you have your final clusters.

Choosing 'k': The Elbow Method

The biggest question in K-Means is: "How do I choose the right value for 'k'?" The elbow method is a popular heuristic to help with this.

It's based on a metric called inertia, or the Within-Cluster Sum of Squares (WCSS). This is the sum of squared distances of samples to their closest cluster center. A smaller inertia means the points are more tightly packed within their clusters.

Here's how the method works:

Run the K-Means algorithm for a range of 'k' values (e.g., from 1 to 10).
For each 'k', record the inertia value.
Plot 'k' versus the inertia.

You will see a curve that typically drops sharply at first and then flattens out. The point where the curve bends, looking like an "elbow," is considered the optimal value for 'k'. This is the point of diminishing returns, where adding another cluster doesn't significantly reduce the total within-cluster variance.

Python

# Python code with scikit-learn
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# 1. Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# 2. Use the elbow method to find the optimal k
inertia = []
K = range(1, 11)
for k in K:
    kmeans_model = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_model.fit(X)
    inertia.append(kmeans_model.inertia_)

# Plot the elbow
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('The Elbow Method')
plt.show()

# 3. From the plot, k=4 looks optimal. Let's build the final model.
final_model = KMeans(n_clusters=4, random_state=42, n_init=10)
y_kmeans = final_model.fit_predict(X)

# 4. Visualize the clusters
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = final_model.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('K-Means Clustering (k=4)')
plt.show()

LearnCodePro

K-Means Clustering: Finding Groups in Your Data

How K-Means Works

Choosing 'k': The Elbow Method

Hierarchical Clustering: Building a Family Tree for Your Data

DBSCAN: Clustering by Density, Not by Shape

Gaussian Mixture Models (GMM): A Soft Clustering Approach

Dimensionality Reduction: Visualizing Data with PCA, t-SNE & UMAP

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

How K-Means Works

Choosing 'k': The Elbow Method

More in Unsupervised Learning

Hierarchical Clustering: Building a Family Tree for Your Data

DBSCAN: Clustering by Density, Not by Shape

Gaussian Mixture Models (GMM): A Soft Clustering Approach

Dimensionality Reduction: Visualizing Data with PCA, t-SNE & UMAP

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?