Feature extraction vs selection
- Extraction: create new features (e.g., datetime → day/month/hour, text → TF-IDF).
- Selection: pick a subset of existing features that are predictive (remove noisy or redundant features).
Polynomial features & interactions
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False) X_poly = poly.fit_transform(X_numeric)
Useful when relationships are non-linear but not huge feature explosion.
Domain knowledge
- Often the best features come from domain insight (ratios, aggregated statistics, counts).
- Example features: price_per_sqft, days_since_last_purchase, avg_session_time.
Dimensionality reduction
- PCA: linear dimensionality reduction — preserves global structure and variance.
from sklearn.decomposition import PCA pca = PCA(n_components=10).fit_transform(X_scaled)
- t-SNE: non-linear, for visualization of high-dimensional embeddings (good for plots, not for downstream models).
- UMAP: faster, preserves global and local structure; good for visualization and sometimes for clustering.
Feature selection tools
- Filter methods: correlation threshold, mutual information.
- Wrapper methods: recursive feature elimination (RFE).
- Embedded methods: feature importance from tree models, L1 regularization.
from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(f_classif, k=20) X_selected = selector.fit_transform(X, y)