GroupBy, Pivot Tables & Aggregation Patterns

Aggregation is the process of summarizing data. For example, calculating the total sales per region or the average score per class. Pandas provides two incredibly powerful tools for this: groupby() and pivot_table().

The Split-Apply-Combine Strategy with groupby()

The groupby() operation follows a three-step process coined by Hadley Wickham:

Split: Break the data into groups based on some criteria (e.g., group all rows with the same country).
Apply: Apply a function to each group independently (e.g., calculate the mean of the population for each country group).
Combine: Combine the results into a new data structure.

Python

import pandas as pd

data = {
    'Department': ['Sales', 'HR', 'IT', 'Sales', 'IT', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [70000, 50000, 80000, 75000, 82000, 55000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# 1. Split the data by 'Department'
grouped = df.groupby('Department')

# 2. Apply an aggregation function (e.g., mean) and Combine
avg_salary_by_dept = grouped['Salary'].mean()
print("\nAverage salary by department:")
print(avg_salary_by_dept)

The grouped object itself is a DataFrameGroupBy object. The magic happens when you apply an aggregation function to it. You can apply many functions: .sum(), .count(), .max(), .min(), etc.

You can also apply multiple aggregations at once using the .agg() method.

Python

# Apply multiple aggregations
aggregations = grouped['Salary'].agg(['mean', 'sum', 'count'])
print("\nMultiple aggregations by department:")
print(aggregations)

Reshaping Data with pivot_table()

A pivot table is a data summarization tool that rotates or "pivots" data to present it from a different perspective. It's fantastic for summarizing data across two or more categorical variables.

A pivot_table() has four key arguments:

values: The column to be aggregated.
index: The column whose unique values will become the rows of the new table.
columns: The column whose unique values will become the columns of the new table.
aggfunc: The aggregation function to apply (default is mean).

Python

data_sales = {
    'Date': pd.to_datetime(['2025-01-05', '2025-01-05', '2025-01-06', '2025-01-06']),
    'Region': ['East', 'West', 'East', 'West'],
    'Sales': [100, 150, 120, 180]
}
df_sales = pd.DataFrame(data_sales)
df_sales['Weekday'] = df_sales['Date'].dt.day_name()
print("\nSales Data:")
print(df_sales)

# Create a pivot table to see sales by region for each weekday
pivot = pd.pivot_table(
    df_sales,
    values='Sales',
    index='Region',
    columns='Weekday',
    aggfunc='sum'
)
print("\nPivot Table (Total Sales by Region and Weekday):")
print(pivot)

LearnCodePro

GroupBy, Pivot Tables & Aggregation Patterns

The Split-Apply-Combine Strategy with groupby()

Reshaping Data with pivot_table()

NumPy Basics: Arrays, Broadcasting & Vectorization

Pandas — Series & DataFrame Basics

Pandas — Indexing, Selection & Boolean Masks

Data Cleaning — Missing Values, Duplicates & Outliers

Feature Extraction from Dates & Text

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

The Split-Apply-Combine Strategy with groupby()

Reshaping Data with pivot_table()

More in Data Wrangling & EDA (Pandas / NumPy)

NumPy Basics: Arrays, Broadcasting & Vectorization

Pandas — Series & DataFrame Basics

Pandas — Indexing, Selection & Boolean Masks

Data Cleaning — Missing Values, Duplicates & Outliers

Feature Extraction from Dates & Text

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?