In this Python Pandas tutorial, we will learn how to Find Duplicates in Python DataFrame using Pandas. Also, we will cover these topics.
- How to identify duplicates in Python DataFrame
- How to find duplicate values in Python DataFrame
- How to find duplicates in a column in Python DataFrame
- How to Count duplicate rows in Pandas DataFrame
How to Find Duplicates in Python DataFrame
- In this Program, we will discuss how to find duplicates in Pandas DataFrame.
- To do this task we can use In Python built-in function such as DataFrame.duplicate() to find duplicate values in Pandas DataFrame.
- In Python DataFrame.duplicated() method will help the user to analyze duplicate values and it will always return a boolean value that is True only for specific elements.
Syntax:
Here is the Syntax of DataFrame.duplicated() method
DataFrame.duplicated
(
subset=None,
keep='first'
)
- It consists of few parameters
- Subset: This parameter takes a column of labels and should be used for duplicates checks and by default its value is None.
- keep: This parameter specifies the occurrence of the value which has to be marked as duplicate. It has three distinct values‘ first’, ‘last’, ‘False’, and by default, it takes the ‘First’ value as an argument.
Example:
Let’s understand a few examples based on these function
Source Code:
import pandas as pd
new_list = [('Australia', 9, 'Germany'),
('China', 14, 'France'), ('Paris', 77, 'switzerland'),
('Australia',9, 'Germany'), ('China', 88, 'Russia'),
('Germany', 77, 'Bangladesh')]
result= pd.DataFrame(new_list, columns=['Country_name', 'Value', 'new_count'])
new_output = result[result.duplicated()]
print("Duplicated values",new_output)
In the above code, we have selected duplicate values based on all columns. Now we have created a DataFrame object in which we have assigned a list ‘new_list’ and columns as an argument. After that to find duplicate values in Pandas DataFrame we use the df. duplicated() function.
Another example to find duplicates in Python DataFrame
In this example, we want to select duplicate rows values based on the selected columns. To perform this task we can use the DataFrame.duplicated() method. Now in this Program first, we will create a list and assign values in it and then create a dataframe in which we have to pass the list of column names in subset as a parameter.
Source Code:
import pandas as pd
student_info = [('George', 78, 'Australia'),
('Micheal', 189, 'Germany'),
('Oliva', 140, 'Malaysia'),
('James', 95, 'Uganda'),
('James', 95, 'Uganda'),
('Oliva', 140, 'Malaysia'),
('Elijah', 391, 'Japan'),
('Chris', 167, 'China')
]
df = pd.DataFrame(student_info,
columns = ['Student_name', 'Student_id', 'Student_city'])
new_duplicate = df[df.duplicated('Student_city')]
print("Duplicate values in City :")
print(new_duplicate)
In the above code Once you will print ‘new_duplicate’ then the output will display the duplicate row values which are present in the given list.
Here is the output of the following given code
Also, Read: Python Pandas CSV Tutorial
How to identify duplicates in Python DataFrame
- Here we can see how to identify Duplicates value in Pandas DataFrame by using Python.
- In Pandas library, DataFrame class provides a function to identify duplicate row values based on columns that is DataFrame.duplicated() method and it always return a boolean series denoting duplicate rows with true value.
Example:
Let’s take an example and check how to identify duplicate row values in Python DataFrame
import pandas as pd
df = pd.DataFrame({'Employee_name': ['George','John', 'Micheal', 'Potter','James','Oliva'],'Languages': ['Ruby','Sql','Mongodb','Ruby','Sql','Python']})
print("Existing DataFrame")
print(df)
print("Identify duplicate values:")
print(df.duplicated())
In the above example, we have set duplicated values in the Pandas DataFrame and then apply the method df. duplicated() it will check the condition if duplicate values are present in the dataframe then it will display ‘true’. if duplicate values do not exist in DataFrame then it will show the ‘False’ boolean value.
You can refer to the below Screenshot
Read: How to get unique values in Pandas DataFrame
Another example to identify duplicates row value in Pandas DataFrame
In this example, we will select duplicate rows based on all columns. To do this task we will pass keep= ‘last’ as an argument and this parameter specifies all duplicates except their last occurrence and it will be marked as ‘True’.
Source Code:
import pandas as pd
employee_name = [('Chris', 178, 'Australia'),
('Hemsworth', 987, 'Newzealand'),
('George', 145, 'Switzerland'),
('Micheal',668, 'Malaysia'),
('Elijah', 402, 'England'),
('Elijah',402, 'England'),
('William',389, 'Russia'),
('Hayden', 995, 'France')
]
df = pd.DataFrame(employee_name,
columns = ['emp_name', 'emp_id', 'emp_city'])
new_val = df[df.duplicated(keep = 'last')]
print("Duplicate Rows :")
print(new_val)
In the above code first, we have imported the Pandas library and then create a list of tuples in which we have assigned the row’s value along with that create a dataframe object and pass keep=’last’ as an argument. Once you will print the ‘new_val’ then the output will display the duplicate rows which are present in the Pandas DataFrame.
Here is the execution of the following given code
Read: Crosstab in Python Pandas
How to find duplicate values in Python DataFrame
- Let us see how to find duplicate values in Python DataFrame.
- Now we want to check if this dataframe contains any duplicates elements or not. To do this task we can use the combination of df.loc() and df.duplicated() method.
- In Python the loc() method is used to retrieve a group of rows columns and it takes only index labels and DataFrame.duplicated() method will help the user to analyze duplicate values in Pandas DataFrame.
Source Code:
import pandas as pd
df=pd.DataFrame(data=[[6,9],[18,77],[6,9],[26,51],[119,783]],columns=['val1','val2'])
new_val = df.duplicated(subset=['val1','val2'], keep='first')
new_output = df.loc[new_val == True]
print(new_output)
In the above code first, we have created a dataframe object in which we have assigned column values. Now we want to replace duplicate values from the given Dataframe by using the df. duplicated() method.
Here is the implementation of the following given code
Read: Groupby in Python Pandas
How to find duplicates in a column in Python DataFrame
- In this program, we will discuss how to find duplicates in a specific column by using Pandas DataFrame.
- By using the DataFrame.duplicate() method we can find duplicates value in Python DataFrame.
Example:
Let’s take an example and check how to find duplicates values in a column
Source Code:
import pandas as pd
Country_name = [('Uganda', 318),
('Newzealand', 113),
('France',189),
('Australia', 788),
('Australia', 788),
('Russia', 467),
('France', 189),
('Paris', 654)
]
df = pd.DataFrame(Country_name,
columns = ['Count_name', 'Count_id'])
new_val = df[df.duplicated('Count_id')]
print("Duplicate Values")
print(new_val)
Here is the output of the following given code
Read: Python Pandas Drop Rows
How to Count duplicate rows in Pandas DataFrame
- Let us see how to Count duplicate rows in Pandas DataFrame.
- By using df.pivot_table we can perform this task. In Python the pivot() function is used to reshaped a Pandas DataFrame by given column values and this method can handle duplicate values for one pivoted pair.
- In Python, the pivot_table() is used to count the duplicates in a Single Column.
Source Code:
import pandas as pd
df = pd.DataFrame({'Student_name' : ['James', 'Potter', 'James',
'William', 'Oliva'],
'Student_desgination' : ['Python developer', 'Tester', 'Tester', 'Q.a assurance', 'Coder'],
'City' : ['Germany', 'Australia', 'Germany',
'Russia', 'France']})
new_val = df.pivot_table(index = ['Student_desgination'], aggfunc ='size')
print(new_val)
In the above code first, we will import a Pandas module then create a DataFrame object in which we have assigned key-value pair elements and consider them as column values.
You can refer to the below Screenshot for counting duplicate rows in DataFrame
You may also like to read the following tutorials on Pandas.
- How to Convert Pandas DataFrame to a Dictionary
- Convert Integers to Datetime in Pandas
- Check If DataFrame is Empty in Python Pandas
- Python Pandas Write DataFrame to Excel
- How to Add a Column to a DataFrame in Python Pandas
- Convert Pandas DataFrame to NumPy Array
- How to Set Column as Index in Python Pandas
- Add row to Dataframe Python Pandas
In this Python Pandas tutorial, we have learned how to Find Duplicates in Python DataFrame using Pandas. Also, we have covered these topics.
- How to identify duplicates in Python DataFrame
- How to find duplicate values in Python DataFrame
- How to find duplicates in a column in Python DataFrame
- How to Count duplicate rows in Pandas DataFrame
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.