Pandas find duplicates in Python [5 Examples]

Do you want to find duplicates in a dataframe? In this Python tutorial, I will tell you, how Pandas find duplicates in Python using different scenarios with examples.

To effectively find duplicates in pandas using Python, the duplicated() method with various parameters has been utilized. This approach covers detecting duplicate rows across all columns, identifying the last occurrences of duplicates, focusing on single or multiple columns, and sorting values for precision.

Pandas find duplicates in Python

In Pandas, duplicates are rows in a DataFrame that have identical values in all columns or a specified subset of columns. Duplicates can arise during data collection, merging datasets, or as a result of data entry errors.

For Pandas find duplicates in Python in rows in a DataFrame, Pandas provides the duplicated() method. This method returns a Boolean Series indicating whether each row is a duplicate or not.

The syntax is:

DataFrame.duplicated(subset=None, keep='first')

Here,

  • subset: Column names to consider for identifying duplicates. By default, all columns are used.
  • keep: Determines which duplicates to mark.
    • ‘first’: Mark all duplicates except the first occurrence as True.
    • ‘last’: Mark all duplicates except the last occurrence as True.
    • False: Mark all duplicates as True.

For example:

import pandas as pd

data = {
    'Employee ID': ['001', '002', '003', '001', '004', '002'],
    'Name': ['John Doe', 'Jane Smith', 'Alice Jones', 'John Doe', 'Bob Brown', 'Jane Smith'],
    'Department': ['HR', 'Marketing', 'IT', 'HR', 'IT', 'Marketing'],
    'State': ['NY', 'CA', 'TX', 'NY', 'NY', 'CA']
}
df = pd.DataFrame(data)
duplicates = df.duplicated()
print(duplicates)

Output: For every second occurrence of the data the duplicated() function will return True.

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool
Pandas find duplicates in Python

Let’s take some scenarios to find how many duplicate rows exist in the dataframe.

1. Pandas find duplicate values in column

To find duplicate rows based on all columns in a DataFrame, we can use the Pandas duplicated() method. This method returns a Boolean series, marking duplicates as True except for their first occurrence.

import pandas as pd

df = pd.DataFrame({
    'OrderID': [101, 102, 103, 101, 104],
    'State': ['CA', 'NY', 'TX', 'CA', 'FL'],
    'Amount': [200, 150, 300, 200, 150]
})
duplicates = df[df.duplicated(keep=False)]
print(duplicates)

Output:

   OrderID State  Amount
0      101    CA     200
3      101    CA     200
find how many duplicate rows exist in the dataframe.

2. Python find duplicates in dataframe to get a list of duplicate last rows

If we want to get the list of the last occurrences of the duplicate rows, set the keep parameter to ‘last‘ in Pandas duplicated() function in Python.

Here is the code that will demonstrate how Pandas find duplicates in Python:

import pandas as pd

df = pd.DataFrame({
    'EmployeeID': [123, 124, 125, 123, 126],
    'Office': ['Seattle', 'Boston', 'Seattle', 'Seattle', 'Boston'],
    'Role': ['Engineer', 'Manager', 'Engineer', 'Engineer', 'Manager']
})
last_duplicates = df[df.duplicated(keep='last')]
print(last_duplicates)

Output:

   EmployeeID   Office      Role
0         123  Seattle  Engineer
how to find duplicate values in python dataframe

3. Find duplicate values in a column Pandas

Pandas find duplicates in Python based on a single column, using the duplicated() method on that specific column.

import pandas as pd

df = pd.DataFrame({
    'FlightNumber': ['AA101', 'AA102', 'AA101', 'AA103', 'AA102'],
    'Destination': ['New York', 'Chicago', 'New York', 'Miami', 'Chicago']
})
flight_duplicates = df[df['FlightNumber'].duplicated(keep=False)]
print(flight_duplicates)

Output:

  FlightNumber Destination
0        AA101    New York
1        AA102     Chicago
2        AA101    New York
4        AA102     Chicago
find duplicates pandas in Python

4. Python find duplicates using multiple columns

For identifying duplicates based on multiple columns, subset the DataFrame before calling the duplicated() function in Python.

import pandas as pd

df = pd.DataFrame({
    'ApplicantID': [1001, 1002, 1003, 1001, 1004],
    'University': ['Harvard', 'MIT', 'Harvard', 'Harvard', 'MIT'],
    'Major': ['CS', 'Physics', 'CS', 'CS', 'Maths']
})
uni_duplicates = df[df.duplicated(subset=['University', 'Major'], keep=False)]
print(uni_duplicates)

Output:

   ApplicantID University Major
0         1001    Harvard    CS
2         1003    Harvard    CS
3         1001    Harvard    CS
check duplicates pandas in Python

5. How to find duplicates in Python dataframe using sort values

Sorting values before finding duplicates can be useful in certain scenarios. It helps in identifying duplicates in a sorted order.

import pandas as pd

df = pd.DataFrame({
    'ProductID': [2001, 2002, 2003, 2001, 2002],
    'State': ['TX', 'CA', 'NY', 'TX', 'CA'],
    'Sales': [500, 600, 500, 500, 600]
})
sorted_df = df.sort_values(by=['State', 'ProductID'])
sorted_duplicates = sorted_df[sorted_df.duplicated(keep=False)]
print(sorted_duplicates)

Output:

   ProductID State  Sales
1       2002    CA    600
4       2002    CA    600
0       2001    TX    500
3       2001    TX    500
pandas show duplicates in Python

Conclusion

Here, I have explained how Python find duplicates in Python using the df.duplicated() function with practical examples like selecting duplicate rows based on all columns, identifying the last duplicate rows, focusing on single and multiple columns, and leveraging sorted values.

These techniques demonstrate the flexibility and power of pandas in handling duplicate data, providing essential tools for efficient data analysis and cleaning in diverse scenarios

You may also like to read: