Feature Engineering is the process of creating new features (input variables) from existing ones to improve the performance of a machine learning model. Raw data, like a specific timestamp or a block of text, is often not directly usable. By extracting meaningful features, you make the underlying patterns more accessible.
Feature Extraction from Dates
Pandas has excellent support for time series data. The first step is to ensure your date column is in the correct datetime format. Then, you can use the .dt accessor to extract a wealth of information.
Python
import pandas as pd
df = pd.DataFrame({'purchase_date': ['2025-01-15', '2025-07-20', '2026-03-01']})
# 1. Convert to datetime objects
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
print("DataFrame with correct datetime type:")
df.info()
# 2. Use the .dt accessor to extract features
df['year'] = df['purchase_date'].dt.year
df['month'] = df['purchase_date'].dt.month
df['day_of_week'] = df['purchase_date'].dt.dayofweek # Monday=0, Sunday=6
df['is_weekend'] = df['day_of_week'].isin([5, 6])
print("\nDataFrame with new date features:")
print(df)
Feature Extraction from Text
For columns with the object (string) dtype, Pandas provides the .str accessor. This allows you to apply nearly all of Python's built-in string methods to an entire Series at once.
Python
import pandas as pd
data = {'product_name': ['Laptop Pro 15in', 'Wireless Mouse', 'USB-C Cable']}
df_text = pd.DataFrame(data)
# Get the length of each product name
df_text['name_length'] = df_text['product_name'].str.len()
# Convert to lowercase
df_text['name_lower'] = df_text['product_name'].str.lower()
# Check if the name contains a substring
df_text['is_cable'] = df_text['product_name'].str.contains('Cable', case=False)
# Split the string and get the first word
df_text['first_word'] = df_text['product_name'].str.split().str[0]
print("DataFrame with new text features:")
print(df_text)
These simple transformations can be powerful. For example, name_length could correlate with product complexity, or is_cable could help categorize products. For more advanced text analysis, you would typically move on to libraries like NLTK or Scikit-learn for techniques like TF-IDF or word embeddings.