Pandas drop_duplicates() function in Python [6 Examples]

In this Python article, I will explain the pandas drop_duplicates() function in Python. Its syntax and parameters are required, with some illustrative examples.

To effectively manage and clean datasets in Python, the Pandas drop_duplicates() function is indispensable. It offers diverse options like removing duplicate rows based on all or a subset of columns, retaining the first or last occurrence of duplicates, and even removing all duplicate entries.

Pandas drop_duplicates() function in Python

Pandas’s drop_duplicates() function is a powerful tool for removing duplicate rows from a DataFrame. This function is especially useful in data preprocessing, where we need to ensure that your dataset does not contain redundant information.

The basic syntax of the Pandas drop_duplicates() function in Python is:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Parameters required:

  1. subset: Specifies the columns for considering duplicates. By default, it uses all of the columns.
  2. keep: Determines which duplicates (if any) to keep.
    • first‘: (default) Drop duplicates except for the first occurrence.
    • last‘: Drop duplicates except for the last occurrence.
    • False: Drop all duplicates.
  3. inplace: If True, performs operation in-place and returns None.
  4. ignore_index: If True, the resulting axis will be labeled 0, 1, …, n – 1.
READ:  Matplotlib xlim - Complete Guide

drop_duplicates in Python Pandas use cases

Below is a detailed explanation of the drop_duplicates() function and several examples to illustrate its use.

1. Pandas drop duplicates function in Python

The simplest use of the Pandas drop_duplicates() function in Python is to remove duplicate rows from a DataFrame based on all columns.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_basic = df.drop_duplicates()
print("Basic Usage:\n", df_basic, "\n")

Output:

Basic Usage:
     Name         City    Event
0  Alice     New York  Concert
1    Bob  Los Angeles   Sports
3    Eve      Chicago  Theatre

A screenshot is mentioned below, after implementing the code in the Pycharm editor.

pandas drop_duplicates() function in Python

2. drop duplicates Pandas in Python with subset of columns

Here, the Pandas drop_duplicates() function in Python is used with the subset parameter to remove duplicates based on specific columns.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_subset = df.drop_duplicates(subset=['Name', 'City'])
print("Subset of Columns:\n", df_subset)

Output:

Subset of Columns:
     Name         City    Event
0  Alice     New York  Concert
1    Bob  Los Angeles   Sports
3    Eve      Chicago  Theatre

After executing the code in Pycharm, one can see the output in the below screenshot.

pandas drop duplicates case insensitive in Python

3. drop_duplicates with keeping the last duplicate

Here, the pandas drop_duplicates() function in Python is configured with keep=’last’ to retain the last occurrence of each duplicate row, instead of the default first occurrence.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_last = df.drop_duplicates(keep='last')
print("Keeping the Last Duplicate:\n", df_last)

Output:

Keeping the Last Duplicate:
     Name         City    Event
2  Alice     New York  Concert
4    Bob  Los Angeles   Sports
5    Eve      Chicago  Theatre

Upon running the code in Pycharm, the resulting output is captured in the screenshot provided below.

pandas drop duplicates multiple conditions in Pythpn

4. Python pandas drop duplicates removing all duplicates

The use of the pandas drop_duplicates() function in Python with keep=False, removes all instances of duplicate rows, leaving unique rows.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_all = df.drop_duplicates(keep=False)
print("Removing All Duplicates:\n", df_all)

Output:

Removing All Duplicates:
 Empty DataFrame
Columns: [Name, City, Event]
Index: []

Below is a screenshot showcasing the result, following the implementation of the code in the Pycharm editor.

Python pandas drop duplicates based on multiple columns

5. Pandas drop duplicates with inplace removal

By using the setting inplace=True parameter with the pandas drop_duplicates() function in Python modify the original DataFarme directly.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_inplace = df.copy()  # Creating a copy to preserve original df
df_inplace.drop_duplicates(inplace=True)
print("Inplace Removal:\n", df_inplace)

Output:

Inplace Removal:
     Name         City    Event
0  Alice     New York  Concert
1    Bob  Los Angeles   Sports
3    Eve      Chicago  Theatre

Following the execution of the code in PyCharm, the resulting output is captured in the screenshot displayed below.

drop_duplicates in Python pandas

6. Python pandas drop duplicates without ignore_index

To reset the index of the DataFrame after duplicate removal, we can use the ignore_index=True parameter, to maintain a sequential index.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Eve', 'Bob', 'Eve'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago'],
    'Event': ['Concert', 'Sports', 'Concert', 'Theatre', 'Sports', 'Theatre']
}
df = pd.DataFrame(data)

df_ignore_index = df.drop_duplicates(ignore_index=True)
print("Ignoring the Index:\n", df_ignore_index)

Output:

Ignoring the Index:
     Name         City    Event
0  Alice     New York  Concert
1    Bob  Los Angeles   Sports
2    Eve      Chicago  Theatre

After implementing the code in the Pycharm editor, the screenshot is mentioned below.

pandas remove duplicates in Python

Conclusion

Here, I have explained how the Pandas drop_duplicates() function in Python is a versatile tool for data cleaning. Through examples like basic usage, a subset of columns, keeping the last duplicate, removing all duplicates, inplace removal, and ignoring the index, its capacity to efficiently handle, modify, and streamline datasets has been demonstrated, highlighting its importance in data analysis and manipulation.

READ:  What is the pd.crosstab function in Python [with 2 Examples]

You may also like to read: