How to count duplicates in Pandas dataframe in Python [4 Methods]

Do you want the count duplicates in the dataframe? In this Pandas tutorial, I will tell you how to count duplicates in Pandas dataframe in Python using different methods with some illustrative examples.

To count duplicates in a Pandas DataFrame in Python, one can utilize methods like df.duplicated() for marking duplicates, groupby() combined with size() for counting occurrences, df.pivot_table() for aggregating duplicate counts, and a custom function leveraging collections.Counter for more tailored duplicate detection.

Count duplicates in Pandas dataframe in Python

Four different methods can help us count duplicates in Pandas dataframe in Python, which are as follows:

  1. df.duplicated() function
  2. Using groupby() with size() function
  3. df.pivot_table() function
  4. Custom Function Using collections.Counter function

Let’s see them one by one using some illustrative examples:

1. Pandas count duplicates using df.duplicated() function

The df.duplicated() method in Pandas identifies duplicate rows in a DataFrame. It returns a Boolean Series where True represents a duplicate row. Then we will use the sum() function to return the numbers of the duplicate rows in the Pandas dataframe.

Here is the full code, to count duplicates in Pandas dataframe in Python:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)
duplicates = df.duplicated()
print(df)
num_duplicates = duplicates.sum()
print(f"Number of duplicate rows: {num_duplicates}")

Output:

    Name  Age
0  Alice   25
1    Bob   30
2  Alice   25
3  David   35
4    Bob   30
Number of duplicate rows: 2

A screenshot is mentioned below, after implementing the code in the Pycharm editor.

How to count duplicates in Pandas dataframe in Python

2. Count duplicates in Pandas dataframe using groupby() with size() function

The groupby() function groups the DataFrame by all or selected columns and then we have to apply the size() function. It provides the count of each unique row, effectively showing how many times each combination of values occurs, including duplicates.

Let’s take an example to count duplicates in Pandas dataframe in Python using groupby() with size() function:

import pandas as pd

data = {'Order_ID': [1001, 1002, 1003, 1004, 1001, 1002],
        'State': ['California', 'Texas', 'New York', 'Florida', 'California', 'Texas'],
        'Amount': [250, 150, 200, 300, 250, 150]}
df = pd.DataFrame(data)
dup_count = df.groupby(['Order_ID', 'State', 'Amount']).size().reset_index(name='Count')
print(dup_count)

Output:

   Order_ID       State  Amount  Count
0      1001  California     250      2
1      1002       Texas     150      2
2      1003    New York     200      1
3      1004     Florida     300      1

After executing the code in Pycharm, one can see the output in the below screenshot.

count duplicate rows pandas in Python

3. Count duplicate rows Pandas dataframe using df.pivot_table() function

The df.pivot_table() is a versatile function used to reshape data. When used for counting duplicates, it creates a pivot table that aggregates and counts occurrences of unique row combinations. This method is useful for seeing the count of duplicates for specific combinations of column values.

To count duplicates in Pandas dataframe in Python, this is the way we can use the df.pivot_table() function:

import pandas as pd

data = {'Flight_Number': ['AA101', 'DL202', 'UA303', 'AA101', 'DL202'],
        'From_City': ['Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'New York'],
        'To_City': ['Chicago', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
pivot_table = df.pivot_table(index=['Flight_Number', 'From_City', 'To_City'], aggfunc='size').reset_index(name='Count')
print(pivot_table)

Output:

  Flight_Number    From_City      To_City  Count
0         AA101  Los Angeles      Chicago      2
1         DL202     New York  Los Angeles      2
2         UA303      Chicago     New York      1

Below is a screenshot showcasing the output after executing the code in the Pycharm editor.

pandas count duplicate rows in Python

4. Pandas count duplicate rows using a custom function with collections.Counter function

We can use the Counter from the collections module, this custom method involves converting DataFrame rows to tuples and counting their frequency. It offers greater flexibility and control, allowing for custom definitions and handling of duplicates across entire rows.

For instance:

import pandas as pd
from collections import Counter

data = {'Employee_ID': [1, 2, 1, 3],
        'Name': ['John', 'Emma', 'John', 'Oliver'],
        'State': ['California', 'Texas', 'California', 'Florida']}
df = pd.DataFrame(data)
def count_duplicates(df):
    rows = [tuple(row) for row in df.to_numpy()]
    return Counter(rows)

duplicate_counts = count_duplicates(df)
print(duplicate_counts)

Output:

Counter({(1, 'John', 'California'): 2, (2, 'Emma', 'Texas'): 1, (3, 'Oliver', 'Florida'): 1})

The following screenshot, taken after implementing the code in Pycharm, displays the result.

pandas count duplicates in dataframe Python

Conclusion

Here, I have explained four effective methods to count duplicates in Pandas dataframe in Python, including using df.duplicated(), groupby() with size(), df.pivot_table(), and a custom function using collections.Counter.

Each method offers a unique approach, catering to different scenarios and data structures, thus providing flexibility and precision in identifying and counting duplicate entries in diverse datasets.

You may also like to read: