How To Find Duplicates In Python Pandas

Finding and handling duplicate values is a common task when working with data in Python. Recently, I was analyzing a customer dataset for a US e-commerce company and needed to identify duplicate customer records that were skewing our analytics.

The pandas library makes this process easy with several built-in methods.

In this guide, I will show you how to find duplicates in pandas using various approaches I’ve refined over my years of data analysis work.

Let us get in..

This Tutorial Covers:

Method 1: Use duplicated() to Identify Duplicate Rows

The most direct way to find duplicates in pandas is to use the Python duplicated() method. This returns a Boolean Series indicating which rows are duplicates.

Here’s how I use it when analyzing customer data:

import pandas as pd

# Sample US customer data
data = {
    'customer_id': [101, 102, 103, 101, 104, 105, 105],
    'name': ['John Smith', 'Sarah Jones', 'Michael Brown', 'John Smith', 
             'Lisa Garcia', 'Robert Taylor', 'Robert Taylor'],
    'state': ['California', 'New York', 'Texas', 'California', 
              'Florida', 'Illinois', 'Illinois']
}

# Create DataFrame
df = pd.DataFrame(data)

# Find duplicate rows
duplicates = df.duplicated()

print("Duplicate rows:")
print(df[duplicates])

Output:

Duplicate rows:                                                                                                                  
   customer_id           name       state
3          101     John Smith  California
6          105  Robert Taylor    Illinois

I executed the above example code and added the screenshot.

By default, duplicated() marks the second and subsequent occurrences of rows as duplicates (returns True).

If you want to see all duplicate rows (including the first occurrence), you can use:

# Find all duplicate rows (including first occurrences)
all_duplicates = df[df.duplicated(keep=False)]
print("\nAll duplicate rows:")
print(all_duplicates)

Method 2: Find Duplicates in Specific Columns

Sometimes, you only need to check specific columns for duplicates. For example, I often need to find customers with identical names but different IDs.

Here’s how I handle that:

# Find duplicates based on name column only
name_duplicates = df[df.duplicated(subset=['name'], keep=False)]
print("Rows with duplicate names:")
print(name_duplicates)

# Find duplicates based on multiple columns
location_duplicates = df[df.duplicated(subset=['name', 'state'], keep=False)]
print("\nRows with duplicate name and state combinations:")
print(location_duplicates)

Output:

Rows with duplicate names:
   customer_id           name       state
0          101     John Smith  California
3          101     John Smith  California
5          105  Robert Taylor    Illinois
6          105  Robert Taylor    Illinois

Rows with duplicate name and state combinations:
   customer_id           name       state
0          101     John Smith  California
3          101     John Smith  California
5          105  Robert Taylor    Illinois
6          105  Robert Taylor    Illinois

I executed the above example code and added the screenshot.

This approach is particularly useful when working with customer databases where certain fields, like email or phone number, should be unique.

Method 3: Count Duplicates with value_counts()

The value_counts() method is a quick and effective way to count how many times each value appears in a column, making it ideal for identifying duplicates.

# Count occurrences of each customer name
name_counts = df['name'].value_counts()
print("Count of each name:")
print(name_counts)

# Filter to show only duplicated names
duplicate_names = df['name'].value_counts()[df['name'].value_counts() > 1]
print("\nDuplicated names and their counts:")
print(duplicate_names)

Output:

Count of each name:
name
John Smith       2
Robert Taylor    2
Sarah Jones      1
Michael Brown    1
Lisa Garcia      1
Name: count, dtype: int64

Duplicated names and their counts:
name
John Smith       2
Robert Taylor    2
Name: count, dtype: int64

I executed the above example code and added the screenshot.

This approach is especially useful for highlighting repeated entries and understanding the frequency distribution of values in your dataset.

Method 4: Use drop_duplicates() to View Unique Rows

You can use the drop_duplicates() method in Python Pandas to view unique rows or specific combinations of columns in your DataFrame.

# Get unique rows
unique_customers = df.drop_duplicates()
print("Unique customer records:")
print(unique_customers)

# Get unique combinations of name and state
unique_name_state = df.drop_duplicates(subset=['name', 'state'])
print("\nUnique name and state combinations:")
print(unique_name_state)

This method helps remove duplicate entries and analyze distinct records or attribute combinations efficiently.

Method 5: Find Duplicates with groupby()

The groupby() method in Pandas allows you to identify duplicate groups based on specific column combinations with greater control and flexibility.

# Group by columns and find groups with more than one record
duplicate_groups = df.groupby(['name', 'state']).filter(lambda x: len(x) > 1)
print("Groups with duplicates:")
print(duplicate_groups)

This approach is useful for detecting repeated patterns or records across groups, making it ideal for deeper data analysis.

Read Create Plots Using Pandas crosstab() in Python

Method 6: Visualize Duplicates

Visualizing duplicates using a bar chart can help quickly identify which values or groups appear most frequently in your dataset.

import matplotlib.pyplot as plt

# Create a DataFrame with duplicate counts
duplicate_analysis = pd.DataFrame(df['state'].value_counts()).reset_index()
duplicate_analysis.columns = ['State', 'Count']

# Sort by count
duplicate_analysis = duplicate_analysis.sort_values('Count', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
plt.bar(duplicate_analysis['State'], duplicate_analysis['Count'])
plt.title('Records by State - Potential Duplicates')
plt.xlabel('State')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This method makes it easier to detect patterns or anomalies in large datasets by providing a clear visual representation of duplicate distributions.

Practical Example: Clean Customer Email Data

Here’s a real-world scenario I encountered when cleaning a customer database:

# Sample customer email data
email_data = {
    'customer_id': [1001, 1002, 1003, 1004, 1005, 1006],
    'email': ['john.doe@example.com', 'jane.smith@example.com', 
              'john.doe@example.com', 'JANE.SMITH@EXAMPLE.COM', 
              'mike.brown@example.com', 'john.doe@example.net']
}

email_df = pd.DataFrame(email_data)

# Standardize emails (lowercase)
email_df['email_standardized'] = email_df['email'].str.lower()

# Find duplicates after standardization
email_duplicates = email_df[email_df.duplicated(subset=['email_standardized'], keep=False)]
print("Email duplicates after standardization:")
print(email_duplicates)

# Get counts of duplicate emails
duplicate_email_counts = email_df['email_standardized'].value_counts()
print("\nEmail counts:")
print(duplicate_email_counts[duplicate_email_counts > 1])

This approach helped me identify duplicate customers who had registered multiple times with slight variations in their email addresses.

Check out np.where in Pandas Python

Advanced Duplicate Detection with Fuzzy Matching

Sometimes, duplicates aren’t exact matches. For customer names with typos or variations, I use fuzzy matching:

# This requires the fuzzywuzzy package
# pip install fuzzywuzzy[speedup]
from fuzzywuzzy import process

# Sample data with name variations
names_data = {
    'customer_id': [201, 202, 203, 204, 205],
    'name': ['Michael Johnson', 'Mike Johnson', 'Michelle Williams', 
             'Michelle Wiliams', 'John Kennedy']
}

names_df = pd.DataFrame(names_data)

# Function to find similar names
def find_similar_names(name_series, threshold=80):
    unique_names = name_series.unique()
    results = []

    for name in unique_names:
        matches = process.extract(name, unique_names, limit=5)
        for match_name, score in matches:
            if (match_name != name) and (score >= threshold):
                results.append((name, match_name, score))

    return pd.DataFrame(results, columns=['Name1', 'Name2', 'Similarity'])

similar_names = find_similar_names(names_df['name'])
print("Potential fuzzy duplicates:")
print(similar_names)

This technique has been invaluable when dealing with customer databases where names might be entered inconsistently.

I hope these methods help you effectively identify and handle duplicates in your pandas DataFrames. Finding duplicates is a crucial step in data cleaning and preprocessing, especially when working with customer data or any dataset where duplicate records can skew your analysis.

The right approach depends on your requirements.

How to Find Duplicates in Python Pandas

Method 1: Use duplicated() to Identify Duplicate Rows

Method 2: Find Duplicates in Specific Columns

Method 3: Count Duplicates with value_counts()

Method 4: Use drop_duplicates() to View Unique Rows

Method 5: Find Duplicates with groupby()

Method 6: Visualize Duplicates

Practical Example: Clean Customer Email Data

Advanced Duplicate Detection with Fuzzy Matching

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends