Many modern datasets have dozens, hundreds, or even thousands of features (dimensions). This "curse of dimensionality" makes it impossible to visualize and difficult to model. Dimensionality reduction is the process of reducing the number of features while trying to preserve as much of the important information as possible. It is a key tool for both data visualization and improving model performance.

We'll look at three popular techniques.

1. Principal Component Analysis (PCA)

PCA is a classic, linear technique that is fast and widely used. Its goal is to find the "principal components" of the data. These are new, artificial axes that point in the directions of the maximum variance in the data.

  • The first principal component (PC1) is the direction that captures the most variance.
  • The second (PC2) is the direction, orthogonal (perpendicular) to PC1, that captures the most remaining variance.
  • And so on...

By projecting the data onto the first few principal components (usually 2 or 3), we can get a lower-dimensional representation that retains the most significant "spread" or global structure of the data. It's excellent for de-noising and as a preprocessing step for other ML algorithms.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a modern, non-linear technique designed almost exclusively for visualization. Unlike PCA, its goal is not to preserve global variance but to preserve local similarities.

It works by modeling the probability that two points are neighbors in the high-dimensional space. It then tries to create a low-dimensional "map" where these neighborhood probabilities are as similar as possible. In short: points that are close together in high dimensions should be close together in the 2D plot.

Important Caveat: The global arrangement of clusters in a t-SNE plot and the distances between them are often meaningless. Its strength is in revealing the fine-grained structure within clusters.

3. Uniform Manifold Approximation and Projection (UMAP)

UMAP is a very new and powerful competitor to t-SNE. It is often much faster and tends to do a better job of preserving the global structure of the data while still excelling at showing local similarities.

Like t-SNE, it is a non-linear method great for visualization, but its speed and better balance of local/global structure have made it a popular choice for general-purpose dimensionality reduction as well. For many visualization tasks, UMAP is now the recommended starting point.

Comparison in Code

Let's see how these three techniques visualize the classic MNIST digits dataset.

Python


# You might need to install umap-learn: pip install umap-learn
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# It's good practice to scale data before these techniques
X_scaled = StandardScaler().fit_transform(X)

# --- Apply the algorithms ---
# PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

# --- Plotting the results ---
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))

# PCA plot
ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='Spectral', s=5)
ax1.set_title('PCA')

# t-SNE plot
ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', s=5)
ax2.set_title('t-SNE')

# UMAP plot
ax3.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='Spectral', s=5)
ax3.set_title('UMAP')

plt.show()

You'll notice that PCA shows some separation but with significant overlap. t-SNE and UMAP do a much better job of forming tight, well-separated clusters for each digit.