Most machine learning algorithms, from linear regression to neural networks, operate on numerical data. They can't understand text categories like "Red," "Green," "Blue," or "New York," "London," "Tokyo." The process of converting this categorical data into numbers is called categorical encoding, and it's a critical step in feature engineering.

1. One-Hot Encoding

This is the most common and straightforward method, especially for nominal categories (where there is no intrinsic order).

The Idea: Create a new binary (0 or 1) column for each unique category. For each row, a '1' is placed in the column corresponding to its category, and '0's are placed everywhere else.

  • Pros:
  • Doesn't assume any order or ranking among the categories.
  • Keeps all original information.
  • Cons:
  • Can lead to the "curse of dimensionality" if a feature has many unique categories (e.g., a 'ZIP Code' column). This creates a huge number of new features, which can slow down training and hurt model performance.

Python


import pandas as pd

# Sample Data
df = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'city': ['NYC', 'London', 'Tokyo', 'London', 'Tokyo']
})

# Use pandas get_dummies for one-hot encoding
one_hot_encoded_df = pd.get_dummies(df, columns=['color', 'city'], prefix=['color', 'city'])

print("Original DataFrame:")
print(df)
print("\nOne-Hot Encoded DataFrame:")
print(one_hot_encoded_df)

2. Frequency / Count Encoding

This technique is a simple way to handle high-cardinality features.

The Idea: Replace each category with the number of times it appears in the dataset (its frequency or count).

  • Pros:
  • Very simple and fast to implement.
  • Doesn't increase the number of features.
  • Cons:
  • Can lose valuable information if the frequency doesn't correlate with the target variable.
  • Different categories might be assigned the same number if they have the same frequency, making them indistinguishable to the model.

Python


import pandas as pd

df = pd.DataFrame({'browser': ['Chrome', 'Firefox', 'Chrome', 'Safari', 'Chrome', 'Firefox']})

# Calculate the frequency of each category
frequency_map = df['browser'].value_counts().to_dict()

# Map the frequencies back to the original column
df['browser_freq_encoded'] = df['browser'].map(frequency_map)

print(frequency_map)
print(df)

3. Target Encoding (Mean Encoding)

This is a powerful and advanced technique, often used in competitive machine learning.

The Idea: Replace each category with the average value of the target variable for that category. For example, if you're predicting customer churn, you could replace the category "New York" with the average churn rate of all customers from New York.

  • Pros:
  • Creates a very powerful and predictive feature that directly captures information from the target.
  • Doesn't increase the feature space.
  • Cons:
  • High risk of target leakage and overfitting. The model sees information about the target variable during training, which can lead it to perform unrealistically well on the training data but fail on new data.
  • Requires careful implementation (e.g., using a separate validation fold or adding smoothing) to prevent overfitting.

This method should be used with caution, especially by beginners.

Quiz