Data cleaning (or data wrangling) is the process of detecting and correcting corrupt or inaccurate records from a dataset. It's often said that data scientists spend 80% of their time cleaning data. Let's cover the three most common tasks.

1. Handling Missing Values

Data is often missing, represented in Pandas as NaN (Not a Number). Your first step is to identify where and how much data is missing.

Python


import pandas as pd
import numpy as np

data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for missing values (returns a boolean DataFrame)
print("\nIs Null Matrix:")
print(df.isnull())

# Count missing values in each column
print("\nMissing values per column:")
print(df.isnull().sum())

Strategy 1: Dropping Missing Values Use .dropna() to remove rows or columns containing NaN. This is simple but can result in significant data loss if not used carefully.

Python


# Drop any row with at least one NaN
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with any NaN:")
print(df_dropped_rows)

# Drop any column with at least one NaN
df_dropped_cols = df.dropna(axis='columns') # or axis=1
print("\nDataFrame after dropping columns with any NaN:")
print(df_dropped_cols)

Strategy 2: Filling Missing Values Use .fillna() to replace NaN with a specific value. This is often a better approach. Common strategies include filling with the mean, median, or mode of the column, or a constant value like 0.

Python


# Fill all NaNs with 0
df_filled_zero = df.fillna(0)
print("\nDataFrame after filling NaN with 0:")
print(df_filled_zero)

# Fill NaNs in column 'B' with the mean of column 'B'
mean_b = df['B'].mean()
df['B'] = df['B'].fillna(mean_b)
print("\nDataFrame after filling column B with its mean:")
print(df)

2. Handling Duplicates

Duplicate rows can skew your analysis. Pandas makes it easy to find and remove them.

Python


data_dups = {'team': ['A', 'B', 'C', 'B'], 'points': [10, 12, 15, 12]}
df_dups = pd.DataFrame(data_dups)
print("\nDataFrame with duplicates:")
print(df_dups)

# Find duplicate rows
print("\nAre rows duplicated?")
print(df_dups.duplicated())

# Remove duplicate rows
df_no_dups = df_dups.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_dups)

3. Identifying Outliers

Outliers are data points that differ significantly from other observations. They can be caused by measurement errors or represent a genuine, rare occurrence. A simple way to identify them is using the Interquartile Range (IQR) method.

  1. Calculate the 1st quartile (Q1, 25th percentile) and 3rd quartile (Q3, 75th percentile).
  2. Calculate the IQR: IQR=Q3−Q1.
  3. Define the outlier boundaries:
  • Lower Bound: Q1−1.5×IQR
  • Upper Bound: Q3+1.5×IQR
  1. Any data point outside these bounds is a potential outlier.

Python


data_outliers = pd.Series([1, 10, 12, 14, 15, 16, 18, 20, 22, 100])
Q1 = data_outliers.quantile(0.25)
Q3 = data_outliers.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nQ1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

# Find the outliers
outliers = data_outliers[(data_outliers < lower_bound) | (data_outliers > upper_bound)]
print(f"Detected outliers: {outliers.tolist()}")