Feature extraction vs selection

  • Extraction: create new features (e.g., datetime → day/month/hour, text → TF-IDF).
  • Selection: pick a subset of existing features that are predictive (remove noisy or redundant features).

Polynomial features & interactions

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_numeric)

Useful when relationships are non-linear but not huge feature explosion.

Domain knowledge

  • Often the best features come from domain insight (ratios, aggregated statistics, counts).
  • Example features: price_per_sqft, days_since_last_purchase, avg_session_time.

Dimensionality reduction

  • PCA: linear dimensionality reduction — preserves global structure and variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=10).fit_transform(X_scaled)
  • t-SNE: non-linear, for visualization of high-dimensional embeddings (good for plots, not for downstream models).
  • UMAP: faster, preserves global and local structure; good for visualization and sometimes for clustering.

Feature selection tools

  • Filter methods: correlation threshold, mutual information.
  • Wrapper methods: recursive feature elimination (RFE).
  • Embedded methods: feature importance from tree models, L1 regularization.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y)