Count Duplicates In Pandas DataFrame In Python

While I was analyzing sales data for a US retail chain, I needed to identify duplicate transactions that had been accidentally entered twice. This is when I realized how important it is to know how to count duplicates in a Pandas DataFrame.

Duplicate data can significantly impact your analysis results, leading to incorrect conclusions and decisions.

In this tutorial, I will share four simple methods to count duplicates in a Pandas DataFrame, along with practical examples.

Table of Contents

Method to Count Duplicates in Pandas DataFrame in Python

Now, I will explain the methods to count duplicates in a pandas DataFrame in Python.

Method 1: Use value_counts() to Count Duplicate Values

Python value_counts() method is one of the simplest ways to count occurrences of each value in a column or series.

Let’s say we have a dataset of customer orders from different states:

import pandas as pd

# Sample data of customer orders from different states
data = {
    'customer_id': [101, 102, 103, 104, 105, 101, 106, 102, 107],
    'state': ['NY', 'CA', 'TX', 'NY', 'FL', 'NY', 'CA', 'CA', 'TX'],
    'amount': [150, 200, 75, 300, 250, 150, 100, 200, 175]
}

df = pd.DataFrame(data)
print(df)

Output:

   customer_id state  amount
0          101    NY     150
1          102    CA     200
2          103    TX      75
3          104    NY     300
4          105    FL     250
5          101    NY     150
6          106    CA     100
7          102    CA     200
8          107    TX     175

You can refer to the screenshot below to see the output.

To count duplicate customer IDs:

# Count occurrences of each customer_id
customer_counts = df['customer_id'].value_counts()
print("\nCustomer ID counts:")
print(customer_counts)

# Find customer_ids that appear more than once
duplicate_customers = customer_counts[customer_counts > 1]
print("\nDuplicate customers:")
print(duplicate_customers)

This gives us the count of each customer ID and shows which customers have made multiple orders.

Read Pandas Iterrows Update Value in Python

Method 2: Use duplicated() with sum() to Count Total Duplicates

When you want to know the total number of duplicate rows in your DataFrame, the duplicated() method in Python combined with sum() is very effective.

# Count total number of duplicate rows
duplicate_rows_count = df.duplicated().sum()
print(f"\nTotal number of duplicate rows: {duplicate_rows_count}")

# Count duplicates based on specific columns
duplicate_customer_states = df.duplicated(subset=['customer_id', 'state']).sum()
print(f"Duplicates based on customer_id and state: {duplicate_customer_states}")

Output:

Total number of duplicate rows: 2
Duplicates based on customer_id and state: 2

You can refer to the screenshot below to see the output.

The duplicated() method returns a Boolean Series where True indicates a duplicate row, and sum() counts the True values.

Method 3: Use groupby() with size() to Count Duplicates by Group

Sometimes you need to know how many times each combination of values appears in your dataset. Python groupby() method with size() is perfect for this scenario.

# Group by customer_id and state, then count occurrences
grouped_counts = df.groupby(['customer_id', 'state']).size().reset_index(name='count')
print("\nCounts by customer_id and state:")
print(grouped_counts)

# Filter to show only duplicates (count > 1)
duplicate_groups = grouped_counts[grouped_counts['count'] > 1]
print("\nDuplicate combinations:")
print(duplicate_groups)

Output:

Counts by customer_id and state:
   customer_id state  count
0          101    NY      2
1          102    CA      2
2          103    TX      1
3          104    NY      1
4          105    FL      1
5          106    CA      1
6          107    TX      1

Duplicate combinations:
   customer_id state  count
0          101    NY      2
1          102    CA      2

You can refer to the screenshot below to see the output.

This approach is especially useful when you want to know which specific combinations of values are duplicated.

Check out Pandas Iterrows in Python

Method 4: Use nunique() vs count() to Identify Columns with Duplicates

To identify which columns contain duplicate values, we can compare the number of unique values to the total count.

# Get total count and unique count for each column
total_counts = df.count()
unique_counts = df.nunique()
duplicate_info = pd.DataFrame({
    'total_count': total_counts,
    'unique_count': unique_counts,
    'duplicate_count': total_counts - unique_counts
})

print("\nDuplicate information by column:")
print(duplicate_info)

This gives you a quick overview of which columns have the most duplicate values.

Practical Example: Analyze E-commerce Order Data

Let’s use a more realistic example with e-commerce order data:

# Sample e-commerce order data
orders_data = {
    'order_id': ['A1001', 'A1002', 'A1003', 'A1004', 'A1005', 'A1003', 'A1006', 'A1002', 'A1007'],
    'customer_id': [201, 202, 203, 204, 205, 203, 206, 202, 207],
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Tablet', 'Mouse', 'Phone', 'Headphones'],
    'state': ['CA', 'NY', 'TX', 'FL', 'CA', 'TX', 'WA', 'NY', 'CA'],
    'amount': [1200, 800, 300, 250, 75, 300, 25, 800, 100]
}

orders_df = pd.DataFrame(orders_data)

# Count duplicate orders
duplicate_orders = orders_df.duplicated(subset=['order_id']).sum()
print(f"Number of duplicate order IDs: {duplicate_orders}")

# Find products that were ordered multiple times
product_counts = orders_df['product'].value_counts()
duplicate_products = product_counts[product_counts > 1]
print("\nProducts ordered multiple times:")
print(duplicate_products)

# Find customer IDs that placed multiple orders
customers_with_multiple_orders = orders_df.groupby('customer_id').size()
print("\nNumber of orders per customer:")
print(customers_with_multiple_orders)

This example helps identify duplicate orders, popular products, and customers who placed multiple orders, valuable insights for an e-commerce business.

Read Pandas Replace Multiple Values in Python

Tips for Handling Duplicates in Pandas

After counting duplicates, you might want to take action. Here are some useful tips:

To remove duplicates, use df.drop_duplicates():

# Remove all duplicate rows
df_clean = df.drop_duplicates()

# Remove duplicates based on specific columns
df_clean_subset = df.drop_duplicates(subset=['customer_id', 'state'])

To keep specific occurrences of duplicates:

# Keep first occurrence
df_first = df.drop_duplicates(keep='first')

# Keep last occurrence
df_last = df.drop_duplicates(keep='last')

# Remove all duplicates (keep none)
df_none = df.drop_duplicates(keep=False)

For large datasets, consider using more efficient methods:

# For very large datasets, this can be more efficient
duplicate_count = len(df) - len(df.drop_duplicates())

Working with duplicates is a common task in data analysis, and Pandas provides multiple efficient ways to identify and count them.

The methods that I explained in this tutorial are: using value_counts(), duplicated() with sum(), groupby() with size(), and using unique() vs count() to identify columns with duplicates.

Count Duplicates in Pandas DataFrame in Python

Method to Count Duplicates in Pandas DataFrame in Python

Method 1: Use value_counts() to Count Duplicate Values

Method 2: Use duplicated() with sum() to Count Total Duplicates

Method 3: Use groupby() with size() to Count Duplicates by Group

Method 4: Use nunique() vs count() to Identify Columns with Duplicates

Practical Example: Analyze E-commerce Order Data

Tips for Handling Duplicates in Pandas

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends