The features you start with are not always enough. Sometimes, the most predictive information is hidden in the relationships between features. Creating new features from existing ones is a hallmark of effective machine learning. This is especially important for linear models (like Linear and Logistic Regression), which, by themselves, can only capture linear patterns.
1. Interaction Features
An interaction feature captures the combined effect of two or more features. You typically create one by multiplying two features together.
Example: Imagine you're predicting how much a user will spend on an e-commerce site. You have the features time_on_site and number_of_items_in_cart.
- A user who spends a lot of time on the site might spend more.
- A user who has many items in their cart might spend more.
But what about a user who spends a lot of time on the site AND has many items in their cart? This interaction could be much more predictive than either feature alone. We can create a new feature: interaction = time_on_site * number_of_items_in_cart.
2. Polynomial Features
Linear models can only fit a straight line to the data. But what if the data has a curve?
By creating polynomial features, we can help a linear model fit non-linear data. We create new features that are powers of the original features (e.g., x2,x3) and combinations of them (e.g., x_1⋅x_2). Adding a feature like x2 to a simple linear regression model (y=β_1x+β_0) turns it into a quadratic model (y=β_2x2+β_1x+β_0), allowing it to fit a curve.
Python
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Sample Data
X = np.arange(6).reshape(3, 2)
print("Original X:\n", X)
# Create polynomial features of degree 2
# It will generate [1, a, b, a^2, a*b, b^2]
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(X)
print("\nPolynomial Features (degree=2):\n", poly_features)
3. Feature Crosses
A feature cross is a specific type of interaction feature that is created by crossing two or more categorical features.
Example: Suppose you are building a model for a food delivery app and have the features time_of_day (with categories 'Morning', 'Afternoon', 'Night') and day_of_week (with categories 'Weekday', 'Weekend').
People's ordering habits might be very different on a 'Weekend' AND at 'Night' compared to a 'Weekday' at 'Night'. By creating a feature cross, you create a new feature like day_and_time with categories like 'Weekday-Night' and 'Weekend-Night', allowing the model to learn a specific weight for that unique combination. This is a very powerful technique for recommendation systems and ad-click prediction models.
Caution: Creating too many of these new features can quickly lead to an explosion in dimensionality and cause your model to overfit. It's a balance between capturing complex relationships and keeping the model simple.