Pandas drop_duplicates(): Remove Duplicate Rows

As a developer, working on a data analysis project, I had to clean a large dataset with many duplicate entries. The issue is, handling duplicates manually would be time-consuming and error-prone. So we need a simple solution.

In this guide, I will cover several ways you can use pandas’ drop_duplicates() function to efficiently remove duplicate rows in Python (with examples for different scenarios).

So let us get started..

pandas drop_duplicates()

The drop_duplicates() function in pandas is a useful method that allows you to remove duplicate rows from a DataFrame. It’s one of those functions I use almost daily in my data cleaning workflows.

This function returns a new DataFrame with duplicate rows removed, keeping only the first occurrence by default (though this behavior can be changed).

Basic Usage of drop_duplicates()

Let’s start with a simple example to understand how drop_duplicates() works:

import pandas as pd

# Create a sample DataFrame with duplicates
data = {
    'Name': ['John', 'Sarah', 'John', 'Mike', 'Sarah'],
    'Age': [28, 32, 28, 45, 32],
    'City': ['New York', 'Boston', 'New York', 'Chicago', 'Boston']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Output:

Original DataFrame:
    Name  Age      City
    Name  Age      City
0   John   28  New York
1  Sarah   32    Boston
2   John   28  New York
3   Mike   45   Chicago
4  Sarah   32    Boston

DataFrame after removing duplicates:
    Name  Age      City
0   John   28  New York
1  Sarah   32    Boston
3   Mike   45   Chicago

I executed the above example code and added the screenshot below.

pandas drop duplicates

In this example, we have duplicate entries for John and Sarah. When we call drop_duplicates() without any parameters, it considers all columns and keeps only the first occurrence of each duplicate row.

Read Convert Python Dictionary to Pandas DataFrame

1. Remove Duplicates Based on Specific Columns

Often, you’ll want to identify duplicates based on only certain columns. For example, in a customer database, you might consider entries duplicates if they have the same name and email, even if other information differs.

Here’s how to do that:

import pandas as pd

# Sample DataFrame with fewer rows
data = {
    'Name': ['Alice', 'Bob', 'Alice'],
    'Email': ['alice@example.com', 'bob@example.com', 'alice@example.com'],
    'Purchase': [100, 200, 150]
}

df = pd.DataFrame(data)

# Drop duplicates based on Name and Email
df_unique = df.drop_duplicates(subset=['Name', 'Email'])

print(df_unique) 

Output:

    Name              Email  Purchase
0  Alice  alice@example.com       100
1    Bob    bob@example.com       200

I executed the above example code and added the screenshot below.

drop duplicates pandas

In this example, we’re considering rows as duplicates only if they have the same values in both the ‘Name’ and ‘Email’ columns. Notice that the last row with Jane Doe is kept because the email is different.

2. Control Which Duplicate to Keep

By default, drop_duplicates() method in Python keeps the first occurrence of a duplicate. However, you can change this behavior using the ‘keep’ parameter:

  • keep=’first’: Keep the first occurrence (default)
  • keep=’last’: Keep the last occurrence
  • keep=False: Drop all duplicates

Here’s an example:

import pandas as pd

# Sample data with duplicates
data = {
    'Item': ['Pen', 'Pencil', 'Pen', 'Eraser'],
    'Price': [1.2, 0.5, 1.3, 0.8]
}

df = pd.DataFrame(data)

# Keep first occurrence (default)
print("Keep first occurrence:")
print(df.drop_duplicates(subset=['Item'], keep='first'))

# Keep last occurrence
print("\nKeep last occurrence:")
print(df.drop_duplicates(subset=['Item'], keep='last'))

# Drop all duplicates
print("\nDrop all duplicates:")
print(df.drop_duplicates(subset=['Item'], keep=False))

Output:

Keep first occurrence:
     Item  Price
0     Pen    1.2
1  Pencil    0.5
3  Eraser    0.8

Keep last occurrence:
     Item  Price
1  Pencil    0.5
2     Pen    1.3
3  Eraser    0.8

Drop all duplicates:
     Item  Price
1  Pencil    0.5
3  Eraser    0.8

I executed the above example code and added the screenshot below.

pandas drop_duplicates

The ‘keep=False’ option is particularly useful when you want to identify and remove all instances of products or items that have duplicates, leaving only those that are truly unique.

3. Modify the Original DataFrame

If you want to modify the original DataFrame instead of creating a new one, you can use the ‘inplace’ parameter:

import pandas as pd

# Sample data
data = {
    'State': ['California', 'New York', 'California', 'Texas', 'New York'],
    'Population': [39.5, 19.3, 39.5, 29.0, 19.3]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Remove duplicates in-place
df.drop_duplicates(inplace=True)
print("\nDataFrame after in-place duplicate removal:")
print(df)

This approach is memory-efficient for large DataFrames as it doesn’t create a copy.

4. Handle Duplicates with Missing Values (NaN)

An important note: pandas considers two NaN (Not a Number) values as equal. This means rows with NaN in the same position can be identified as duplicates:

import pandas as pd
import numpy as np

# Sample data with NaN values
data = {
    'ID': [1, 2, 3, 4, 5],
    'Value': [10, np.nan, 20, np.nan, 30]
}

df = pd.DataFrame(data)
print("Original DataFrame with NaN values:")
print(df)

# Remove duplicates
df_no_duplicates = df.drop_duplicates(subset=['Value'])
print("\nAfter removing duplicates based on Value column:")
print(df_no_duplicates)

In this example, only one of the rows with NaN in the ‘Value’ column will be kept.

Check out Pandas str.replace Multiple Values in Python

Practical Example: Clean US Customer Data

Let’s look at a more realistic example using a customer dataset from a US-based e-commerce company:

import pandas as pd

# Sample US customer data with duplicates
data = {
    'Customer_ID': [1001, 1002, 1003, 1001, 1004, 1005, 1003],
    'Name': ['John Davis', 'Mary Wilson', 'Robert Brown', 'John Davis', 
             'Jennifer Smith', 'Michael Johnson', 'Robert Brown'],
    'Email': ['john.d@example.com', 'mary.w@example.com', 'rob.b@example.com',
              'john.d@example.com', 'jen.s@example.com', 'mike.j@example.com',
              'robert.b@different.com'],
    'State': ['California', 'New York', 'Texas', 'California', 
              'Florida', 'Illinois', 'Texas'],
    'Purchase_Amount': [120.50, 85.20, 200.00, 75.30, 150.00, 95.80, 65.40]
}

customers_df = pd.DataFrame(data)
print("Original customer dataset:")
print(customers_df)

# 1. Remove exact duplicates
exact_duplicates_removed = customers_df.drop_duplicates()
print("\nAfter removing exact duplicates:")
print(exact_duplicates_removed)

# 2. Remove duplicates based on Customer_ID and Email
id_email_duplicates_removed = customers_df.drop_duplicates(subset=['Customer_ID', 'Email'])
print("\nAfter removing duplicates based on Customer_ID and Email:")
print(id_email_duplicates_removed)

# 3. Keep the record with the highest purchase amount for each customer
# Sort by Purchase_Amount in descending order first
customers_df_sorted = customers_df.sort_values('Purchase_Amount', ascending=False)
highest_purchases = customers_df_sorted.drop_duplicates(subset=['Customer_ID'], keep='first')
print("\nKeeping only the highest purchase amount for each customer:")
print(highest_purchases)

This example shows how you might clean customer data by:

  1. Removing exact duplicates across all columns
  2. Removing duplicates based on key identifiers (Customer_ID and Email)
  3. Keeping only the highest-value purchase for each customer

Read Pandas Find Duplicates in Python

Performance Considerations

For very large datasets, drop_duplicates() can be computationally expensive. Here are some tips to improve performance:

  1. Only include necessary columns in the ‘subset’ parameter
  2. Consider using ‘inplace=True’ to save memory
  3. If possible, sort your data first on the columns you’re checking for duplicates

I’ve found that for datasets with millions of rows, it can be more efficient to first identify which columns contain duplicates before running drop_duplicates().

Combine with Other Pandas Functions

Python drop_duplicates() function becomes even more useful when combined with other pandas functions. Here’s a quick example that demonstrates a common data cleaning workflow:

import pandas as pd

# Load data (imagine this is a large CSV file)
# df = pd.read_csv('large_dataset.csv')

# For demonstration, we'll create a sample DataFrame
df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 1, 2],
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'B'],
    'Value': [100, 200, 300, 400, 500, 150, 250]
})

# Complete data cleaning workflow
clean_df = (df
    # First, fill any missing values
    .fillna({'Category': 'Unknown', 'Value': 0})
    # Then normalize text columns to lowercase
    .assign(Category=lambda x: x['Category'].str.lower())
    # Now remove duplicates based on ID
    .drop_duplicates(subset=['ID'], keep='last')
    # Finally, sort by ID
    .sort_values('ID')
    # Reset index
    .reset_index(drop=True)
)

print("Original DataFrame:")
print(df)
print("\nCleaned DataFrame:")
print(clean_df)

This pattern of chaining pandas operations together is extremely efficient and makes your code more readable.

The drop_duplicates() function is an essential tool in any data scientist’s toolkit. It offers a simple yet flexible way to handle duplicate data in pandas DataFrames.

Remember that for large datasets, performance can be a consideration, so it’s often best to be selective about which columns you use to identify duplicates.

You may like to read:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.