Aggregation is the process of summarizing data. For example, calculating the total sales per region or the average score per class. Pandas provides two incredibly powerful tools for this: groupby() and pivot_table().
The Split-Apply-Combine Strategy with groupby()
The groupby() operation follows a three-step process coined by Hadley Wickham:
- Split: Break the data into groups based on some criteria (e.g., group all rows with the same country).
- Apply: Apply a function to each group independently (e.g., calculate the mean of the population for each country group).
- Combine: Combine the results into a new data structure.
Python
import pandas as pd
data = {
    'Department': ['Sales', 'HR', 'IT', 'Sales', 'IT', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [70000, 50000, 80000, 75000, 82000, 55000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# 1. Split the data by 'Department'
grouped = df.groupby('Department')
# 2. Apply an aggregation function (e.g., mean) and Combine
avg_salary_by_dept = grouped['Salary'].mean()
print("\nAverage salary by department:")
print(avg_salary_by_dept)
The grouped object itself is a DataFrameGroupBy object. The magic happens when you apply an aggregation function to it. You can apply many functions: .sum(), .count(), .max(), .min(), etc.
You can also apply multiple aggregations at once using the .agg() method.
Python
# Apply multiple aggregations
aggregations = grouped['Salary'].agg(['mean', 'sum', 'count'])
print("\nMultiple aggregations by department:")
print(aggregations)
Reshaping Data with pivot_table()
A pivot table is a data summarization tool that rotates or "pivots" data to present it from a different perspective. It's fantastic for summarizing data across two or more categorical variables.
A pivot_table() has four key arguments:
- values: The column to be aggregated.
- index: The column whose unique values will become the rows of the new table.
- columns: The column whose unique values will become the columns of the new table.
- aggfunc: The aggregation function to apply (default is mean).
Python
data_sales = {
    'Date': pd.to_datetime(['2025-01-05', '2025-01-05', '2025-01-06', '2025-01-06']),
    'Region': ['East', 'West', 'East', 'West'],
    'Sales': [100, 150, 120, 180]
}
df_sales = pd.DataFrame(data_sales)
df_sales['Weekday'] = df_sales['Date'].dt.day_name()
print("\nSales Data:")
print(df_sales)
# Create a pivot table to see sales by region for each weekday
pivot = pd.pivot_table(
    df_sales,
    values='Sales',
    index='Region',
    columns='Weekday',
    aggfunc='sum'
)
print("\nPivot Table (Total Sales by Region and Weekday):")
print(pivot)