How to Find Duplicates in Python DataFrame

In this Python Pandas tutorial, we will learn how to Find Duplicates in Python DataFrame using Pandas. Also, we will cover these topics.

  • How to identify duplicates in Python DataFrame
  • How to find duplicate values in Python DataFrame
  • How to find duplicates in a column in Python DataFrame
  • How to Count duplicate rows in Pandas DataFrame

How to Find Duplicates in Python DataFrame

  • In this Program, we will discuss how to find duplicates in Pandas DataFrame.
  • To do this task we can use In Python built-in function such as DataFrame.duplicate() to find duplicate values in Pandas DataFrame.
  • In Python DataFrame.duplicated() method will help the user to analyze duplicate values and it will always return a boolean value that is True only for specific elements.

Syntax:

Here is the Syntax of DataFrame.duplicated() method

DataFrame.duplicated
                    (
                     subset=None,
                     keep='first'
                    )
  • It consists of few parameters
    • Subset: This parameter takes a column of labels and should be used for duplicates checks and by default its value is None.
    • keep: This parameter specifies the occurrence of the value which has to be marked as duplicate. It has three distinct values‘ first’, ‘last’, ‘False’, and by default, it takes the ‘First’ value as an argument.

Example:

Let’s understand a few examples based on these function

Source Code:

import pandas as pd

new_list = [('Australia', 9, 'Germany'),
          ('China', 14, 'France'), ('Paris', 77, 'switzerland'),
          ('Australia',9, 'Germany'), ('China', 88, 'Russia'),
         ('Germany', 77, 'Bangladesh')]

result= pd.DataFrame(new_list, columns=['Country_name', 'Value', 'new_count'])
new_output = result[result.duplicated()]
print("Duplicated values",new_output)

In the above code, we have selected duplicate values based on all columns. Now we have created a DataFrame object in which we have assigned a list ‘new_list’ and columns as an argument. After that to find duplicate values in Pandas DataFrame we use the df. duplicated() function.

How to Find Duplicates in Python DataFrame
How to Find Duplicates in Python DataFrame

Another example to find duplicates in Python DataFrame

In this example, we want to select duplicate rows values based on the selected columns. To perform this task we can use the DataFrame.duplicated() method. Now in this Program first, we will create a list and assign values in it and then create a dataframe in which we have to pass the list of column names in subset as a parameter.

Source Code:

import pandas as pd

student_info = [('George', 78, 'Australia'),
			('Micheal', 189, 'Germany'),
			('Oliva', 140, 'Malaysia'),
			('James', 95, 'Uganda'),
			('James', 95, 'Uganda'),
			('Oliva', 140, 'Malaysia'),
			('Elijah', 391, 'Japan'),
			('Chris', 167, 'China')
			]

df = pd.DataFrame(student_info,
				columns = ['Student_name', 'Student_id', 'Student_city'])


new_duplicate = df[df.duplicated('Student_city')]

print("Duplicate values in City :")
print(new_duplicate)

In the above code Once you will print ‘new_duplicate’ then the output will display the duplicate row values which are present in the given list.

Here is the output of the following given code

How to Find Duplicates in Python DataFrame
How to Find Duplicates in Python DataFrame

Also, Read: Python Pandas CSV Tutorial

How to identify duplicates in Python DataFrame

  • Here we can see how to identify Duplicates value in Pandas DataFrame by using Python.
  • In Pandas library, DataFrame class provides a function to identify duplicate row values based on columns that is DataFrame.duplicated() method and it always return a boolean series denoting duplicate rows with true value.

Example:

Let’s take an example and check how to identify duplicate row values in Python DataFrame

import pandas as pd

df = pd.DataFrame({'Employee_name': ['George','John', 'Micheal', 'Potter','James','Oliva'],'Languages': ['Ruby','Sql','Mongodb','Ruby','Sql','Python']})
print("Existing DataFrame")
print(df)
print("Identify duplicate values:")
print(df.duplicated())

In the above example, we have set duplicated values in the Pandas DataFrame and then apply the method df. duplicated() it will check the condition if duplicate values are present in the dataframe then it will display ‘true’. if duplicate values do not exist in DataFrame then it will show the ‘False’ boolean value.

You can refer to the below Screenshot

How to identify duplicates in Python DataFrame
How to identify duplicates in Python DataFrame

Another example to identify duplicates row value in Pandas DataFrame

In this example, we will select duplicate rows based on all columns. To do this task we will pass keep= ‘last’ as an argument and this parameter specifies all duplicates except their last occurrence and it will be marked as ‘True’.

Source Code:

import pandas as pd

employee_name = [('Chris', 178, 'Australia'),
			('Hemsworth', 987, 'Newzealand'),
			('George', 145, 'Switzerland'),
			('Micheal',668, 'Malaysia'),
			('Elijah', 402, 'England'),
			('Elijah',402, 'England'),
			('William',389, 'Russia'),
			('Hayden', 995, 'France')
			]


df = pd.DataFrame(employee_name,
				columns = ['emp_name', 'emp_id', 'emp_city'])

new_val = df[df.duplicated(keep = 'last')]

print("Duplicate Rows :")
print(new_val)

In the above code first, we have imported the Pandas library and then create a list of tuples in which we have assigned the row’s value along with that create a dataframe object and pass keep=’last’ as an argument. Once you will print the ‘new_val’ then the output will display the duplicate rows which are present in the Pandas DataFrame.

Here is the execution of the following given code

How to identify duplicates in Python DataFrame
How to identify duplicates in Python DataFrame

Read: Crosstab in Python Pandas

How to find duplicate values in Python DataFrame

  • Let us see how to find duplicate values in Python DataFrame.
  • Now we want to check if this dataframe contains any duplicates elements or not. To do this task we can use the combination of df.loc() and df.duplicated() method.
  • In Python the loc() method is used to retrieve a group of rows columns and it takes only index labels and DataFrame.duplicated() method will help the user to analyze duplicate values in Pandas DataFrame.

Source Code:

import pandas as pd

df=pd.DataFrame(data=[[6,9],[18,77],[6,9],[26,51],[119,783]],columns=['val1','val2'])
new_val = df.duplicated(subset=['val1','val2'], keep='first')
new_output = df.loc[new_val == True]
print(new_output)

In the above code first, we have created a dataframe object in which we have assigned column values. Now we want to replace duplicate values from the given Dataframe by using the df. duplicated() method.

Here is the implementation of the following given code

How to find duplicate values in Python DataFrame
How to find duplicate values in Python DataFrame

Read: Groupby in Python Pandas

How to find duplicates in a column in Python DataFrame

  • In this program, we will discuss how to find duplicates in a specific column by using Pandas DataFrame.
  • By using the DataFrame.duplicate() method we can find duplicates value in Python DataFrame.

Example:

Let’s take an example and check how to find duplicates values in a column

Source Code:

import pandas as pd

Country_name = [('Uganda', 318),
			('Newzealand', 113),
			('France',189),
			('Australia', 788),
			('Australia', 788),
			('Russia', 467),
			('France', 189),
			('Paris', 654)
			]

df = pd.DataFrame(Country_name,
				columns = ['Count_name', 'Count_id'])

new_val = df[df.duplicated('Count_id')]
print("Duplicate Values")
print(new_val)

Here is the output of the following given code

How to find duplicates in a column in Python DataFrame
How to find duplicates in a column in Python DataFrame

Read: Python Pandas Drop Rows

How to Count duplicate rows in Pandas DataFrame

  • Let us see how to Count duplicate rows in Pandas DataFrame.
  • By using df.pivot_table we can perform this task. In Python the pivot() function is used to reshaped a Pandas DataFrame by given column values and this method can handle duplicate values for one pivoted pair.
  • In Python, the pivot_table() is used to count the duplicates in a Single Column.

Source Code:

import pandas as pd

df = pd.DataFrame({'Student_name' : ['James', 'Potter', 'James',
							'William', 'Oliva'],
				'Student_desgination' : ['Python developer', 'Tester', 'Tester', 'Q.a assurance', 'Coder'],
				'City' : ['Germany', 'Australia', 'Germany',
								'Russia', 'France']})

new_val = df.pivot_table(index = ['Student_desgination'], aggfunc ='size')

print(new_val)

In the above code first, we will import a Pandas module then create a DataFrame object in which we have assigned key-value pair elements and consider them as column values.

You can refer to the below Screenshot for counting duplicate rows in DataFrame

How to Count duplicate rows in Pandas DataFrame
How to Count duplicate rows in Pandas DataFrame

You may also like to read the following tutorials on Pandas.

In this Python Pandas tutorial, we have learned how to Find Duplicates in Python DataFrame using Pandas. Also, we have covered these topics.

  • How to identify duplicates in Python DataFrame
  • How to find duplicate values in Python DataFrame
  • How to find duplicates in a column in Python DataFrame
  • How to Count duplicate rows in Pandas DataFrame