Feature Extraction from Dates & Text

Feature Engineering is the process of creating new features (input variables) from existing ones to improve the performance of a machine learning model. Raw data, like a specific timestamp or a block of text, is often not directly usable. By extracting meaningful features, you make the underlying patterns more accessible.

Feature Extraction from Dates

Pandas has excellent support for time series data. The first step is to ensure your date column is in the correct datetime format. Then, you can use the .dt accessor to extract a wealth of information.

Python

import pandas as pd

df = pd.DataFrame({'purchase_date': ['2025-01-15', '2025-07-20', '2026-03-01']})

# 1. Convert to datetime objects
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
print("DataFrame with correct datetime type:")
df.info()

# 2. Use the .dt accessor to extract features
df['year'] = df['purchase_date'].dt.year
df['month'] = df['purchase_date'].dt.month
df['day_of_week'] = df['purchase_date'].dt.dayofweek # Monday=0, Sunday=6
df['is_weekend'] = df['day_of_week'].isin([5, 6])

print("\nDataFrame with new date features:")
print(df)

Feature Extraction from Text

For columns with the object (string) dtype, Pandas provides the .str accessor. This allows you to apply nearly all of Python's built-in string methods to an entire Series at once.

Python

import pandas as pd

data = {'product_name': ['Laptop Pro 15in', 'Wireless Mouse', 'USB-C Cable']}
df_text = pd.DataFrame(data)

# Get the length of each product name
df_text['name_length'] = df_text['product_name'].str.len()

# Convert to lowercase
df_text['name_lower'] = df_text['product_name'].str.lower()

# Check if the name contains a substring
df_text['is_cable'] = df_text['product_name'].str.contains('Cable', case=False)

# Split the string and get the first word
df_text['first_word'] = df_text['product_name'].str.split().str[0]

print("DataFrame with new text features:")
print(df_text)

These simple transformations can be powerful. For example, name_length could correlate with product complexity, or is_cable could help categorize products. For more advanced text analysis, you would typically move on to libraries like NLTK or Scikit-learn for techniques like TF-IDF or word embeddings.

LearnCodePro

Feature Extraction from Dates & Text

Feature Extraction from Dates

Feature Extraction from Text

NumPy Basics: Arrays, Broadcasting & Vectorization

Pandas — Series & DataFrame Basics

Pandas — Indexing, Selection & Boolean Masks

Data Cleaning — Missing Values, Duplicates & Outliers

GroupBy, Pivot Tables & Aggregation Patterns

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

Feature Extraction from Dates

Feature Extraction from Text

More in Data Wrangling & EDA (Pandas / NumPy)

NumPy Basics: Arrays, Broadcasting & Vectorization

Pandas — Series & DataFrame Basics

Pandas — Indexing, Selection & Boolean Masks

Data Cleaning — Missing Values, Duplicates & Outliers

GroupBy, Pivot Tables & Aggregation Patterns

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?