Aggregation is the process of summarizing data. For example, calculating the total sales per region or the average score per class. Pandas provides two incredibly powerful tools for this: groupby() and pivot_table().

The Split-Apply-Combine Strategy with groupby()

The groupby() operation follows a three-step process coined by Hadley Wickham:

  1. Split: Break the data into groups based on some criteria (e.g., group all rows with the same country).
  2. Apply: Apply a function to each group independently (e.g., calculate the mean of the population for each country group).
  3. Combine: Combine the results into a new data structure.

Python


import pandas as pd

data = {
    'Department': ['Sales', 'HR', 'IT', 'Sales', 'IT', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [70000, 50000, 80000, 75000, 82000, 55000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# 1. Split the data by 'Department'
grouped = df.groupby('Department')

# 2. Apply an aggregation function (e.g., mean) and Combine
avg_salary_by_dept = grouped['Salary'].mean()
print("\nAverage salary by department:")
print(avg_salary_by_dept)

The grouped object itself is a DataFrameGroupBy object. The magic happens when you apply an aggregation function to it. You can apply many functions: .sum(), .count(), .max(), .min(), etc.

You can also apply multiple aggregations at once using the .agg() method.

Python


# Apply multiple aggregations
aggregations = grouped['Salary'].agg(['mean', 'sum', 'count'])
print("\nMultiple aggregations by department:")
print(aggregations)

Reshaping Data with pivot_table()

A pivot table is a data summarization tool that rotates or "pivots" data to present it from a different perspective. It's fantastic for summarizing data across two or more categorical variables.

A pivot_table() has four key arguments:

  • values: The column to be aggregated.
  • index: The column whose unique values will become the rows of the new table.
  • columns: The column whose unique values will become the columns of the new table.
  • aggfunc: The aggregation function to apply (default is mean).

Python


data_sales = {
    'Date': pd.to_datetime(['2025-01-05', '2025-01-05', '2025-01-06', '2025-01-06']),
    'Region': ['East', 'West', 'East', 'West'],
    'Sales': [100, 150, 120, 180]
}
df_sales = pd.DataFrame(data_sales)
df_sales['Weekday'] = df_sales['Date'].dt.day_name()
print("\nSales Data:")
print(df_sales)

# Create a pivot table to see sales by region for each weekday
pivot = pd.pivot_table(
    df_sales,
    values='Sales',
    index='Region',
    columns='Weekday',
    aggfunc='sum'
)
print("\nPivot Table (Total Sales by Region and Weekday):")
print(pivot)