Imagine you're building a model to predict house prices using two features: number_of_rooms (ranging from 1 to 10) and square_footage (ranging from 500 to 5000). Many machine learning algorithms will be biased toward the square_footage feature simply because its values are much larger.

Feature scaling is the process of transforming your data to put all features on a similar scale. This ensures that no single feature dominates the model's learning process just because of its magnitude. It's a crucial preprocessing step for many algorithms.

When is Scaling Necessary?

Scaling is essential for algorithms that are sensitive to the distance between data points or use gradient descent for optimization. This includes:

  • Distance-Based Algorithms: K-Nearest Neighbors (kNN), Support Vector Machines (SVM), Clustering algorithms.
  • Gradient-Based Algorithms: Linear Regression, Logistic Regression, Neural Networks.
  • Note: Tree-based models like Decision Trees and Random Forests are generally not sensitive to feature scaling.

1. Standardization (Z-score Scaling)

Standardization rescales the data to have a mean of 0 and a standard deviation of 1.

The formula for each feature is:

z=σx−μ​

Where μ is the mean of the feature and σ is its standard deviation.

  • Key Property: The resulting distribution will be centered at 0. It does not bound the data to a specific range (you can have values like -3.5 or 2.8).
  • Best For: It's a great default choice and is less affected by outliers than Normalization.

Python


from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample Data: [Age, Salary]
data = np.array([[25, 50000], [45, 120000], [30, 75000], [50, 150000]])

# Initialize and apply the scaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Original Data:\n", data)
print("\nStandardized Data:\n", standardized_data)
print(f"\nMean: {standardized_data.mean(axis=0).round(2)}, Standard Deviation: {standardized_data.std(axis=0)}")

2. Normalization (Min-Max Scaling)

Normalization rescales the data to a fixed range, typically between 0 and 1.

The formula for each feature is:

Xnorm​=Xmax​−Xmin​X−Xmin​​

  • Key Property: All values in the transformed feature will be squeezed into the [0, 1] interval.
  • Best For: Useful for algorithms that expect data in a bounded range, like neural networks. However, it is very sensitive to outliers. A single extreme value can shrink the range of all other data points.

Python


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample Data: [Age, Salary]
data = np.array([[25, 50000], [45, 120000], [30, 75000], [50, 150000]])

# Initialize and apply the scaler
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print("Original Data:\n", data)
print("\nNormalized Data:\n", normalized_data)