Finding and handling duplicate values is a common task when working with data in Python. Recently, I was analyzing a customer dataset for a US e-commerce company and needed to identify duplicate customer records that were skewing our analytics.
The pandas library makes this process easy with several built-in methods.
In this guide, I will show you how to find duplicates in pandas using various approaches I’ve refined over my years of data analysis work.
Let us get in..
Method 1: Use duplicated() to Identify Duplicate Rows
The most direct way to find duplicates in pandas is to use the Python duplicated() method. This returns a Boolean Series indicating which rows are duplicates.
Here’s how I use it when analyzing customer data:
import pandas as pd
# Sample US customer data
data = {
'customer_id': [101, 102, 103, 101, 104, 105, 105],
'name': ['John Smith', 'Sarah Jones', 'Michael Brown', 'John Smith',
'Lisa Garcia', 'Robert Taylor', 'Robert Taylor'],
'state': ['California', 'New York', 'Texas', 'California',
'Florida', 'Illinois', 'Illinois']
}
# Create DataFrame
df = pd.DataFrame(data)
# Find duplicate rows
duplicates = df.duplicated()
print("Duplicate rows:")
print(df[duplicates])Output:
Duplicate rows:
customer_id name state
3 101 John Smith California
6 105 Robert Taylor IllinoisI executed the above example code and added the screenshot.

By default, duplicated() marks the second and subsequent occurrences of rows as duplicates (returns True).
If you want to see all duplicate rows (including the first occurrence), you can use:
# Find all duplicate rows (including first occurrences)
all_duplicates = df[df.duplicated(keep=False)]
print("\nAll duplicate rows:")
print(all_duplicates)Method 2: Find Duplicates in Specific Columns
Sometimes, you only need to check specific columns for duplicates. For example, I often need to find customers with identical names but different IDs.
Here’s how I handle that:
# Find duplicates based on name column only
name_duplicates = df[df.duplicated(subset=['name'], keep=False)]
print("Rows with duplicate names:")
print(name_duplicates)
# Find duplicates based on multiple columns
location_duplicates = df[df.duplicated(subset=['name', 'state'], keep=False)]
print("\nRows with duplicate name and state combinations:")
print(location_duplicates)Output:
Rows with duplicate names:
customer_id name state
0 101 John Smith California
3 101 John Smith California
5 105 Robert Taylor Illinois
6 105 Robert Taylor Illinois
Rows with duplicate name and state combinations:
customer_id name state
0 101 John Smith California
3 101 John Smith California
5 105 Robert Taylor Illinois
6 105 Robert Taylor IllinoisI executed the above example code and added the screenshot.

This approach is particularly useful when working with customer databases where certain fields, like email or phone number, should be unique.
Method 3: Count Duplicates with value_counts()
The value_counts() method is a quick and effective way to count how many times each value appears in a column, making it ideal for identifying duplicates.
# Count occurrences of each customer name
name_counts = df['name'].value_counts()
print("Count of each name:")
print(name_counts)
# Filter to show only duplicated names
duplicate_names = df['name'].value_counts()[df['name'].value_counts() > 1]
print("\nDuplicated names and their counts:")
print(duplicate_names)Output:
Count of each name:
name
John Smith 2
Robert Taylor 2
Sarah Jones 1
Michael Brown 1
Lisa Garcia 1
Name: count, dtype: int64
Duplicated names and their counts:
name
John Smith 2
Robert Taylor 2
Name: count, dtype: int64I executed the above example code and added the screenshot.

This approach is especially useful for highlighting repeated entries and understanding the frequency distribution of values in your dataset.
Method 4: Use drop_duplicates() to View Unique Rows
You can use the drop_duplicates() method in Python Pandas to view unique rows or specific combinations of columns in your DataFrame.
# Get unique rows
unique_customers = df.drop_duplicates()
print("Unique customer records:")
print(unique_customers)
# Get unique combinations of name and state
unique_name_state = df.drop_duplicates(subset=['name', 'state'])
print("\nUnique name and state combinations:")
print(unique_name_state)This method helps remove duplicate entries and analyze distinct records or attribute combinations efficiently.
Method 5: Find Duplicates with groupby()
The groupby() method in Pandas allows you to identify duplicate groups based on specific column combinations with greater control and flexibility.
# Group by columns and find groups with more than one record
duplicate_groups = df.groupby(['name', 'state']).filter(lambda x: len(x) > 1)
print("Groups with duplicates:")
print(duplicate_groups)This approach is useful for detecting repeated patterns or records across groups, making it ideal for deeper data analysis.
Read Create Plots Using Pandas crosstab() in Python
Method 6: Visualize Duplicates
Visualizing duplicates using a bar chart can help quickly identify which values or groups appear most frequently in your dataset.
import matplotlib.pyplot as plt
# Create a DataFrame with duplicate counts
duplicate_analysis = pd.DataFrame(df['state'].value_counts()).reset_index()
duplicate_analysis.columns = ['State', 'Count']
# Sort by count
duplicate_analysis = duplicate_analysis.sort_values('Count', ascending=False)
# Plot
plt.figure(figsize=(10, 6))
plt.bar(duplicate_analysis['State'], duplicate_analysis['Count'])
plt.title('Records by State - Potential Duplicates')
plt.xlabel('State')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()This method makes it easier to detect patterns or anomalies in large datasets by providing a clear visual representation of duplicate distributions.
Practical Example: Clean Customer Email Data
Here’s a real-world scenario I encountered when cleaning a customer database:
# Sample customer email data
email_data = {
'customer_id': [1001, 1002, 1003, 1004, 1005, 1006],
'email': ['john.doe@example.com', 'jane.smith@example.com',
'john.doe@example.com', 'JANE.SMITH@EXAMPLE.COM',
'mike.brown@example.com', 'john.doe@example.net']
}
email_df = pd.DataFrame(email_data)
# Standardize emails (lowercase)
email_df['email_standardized'] = email_df['email'].str.lower()
# Find duplicates after standardization
email_duplicates = email_df[email_df.duplicated(subset=['email_standardized'], keep=False)]
print("Email duplicates after standardization:")
print(email_duplicates)
# Get counts of duplicate emails
duplicate_email_counts = email_df['email_standardized'].value_counts()
print("\nEmail counts:")
print(duplicate_email_counts[duplicate_email_counts > 1])This approach helped me identify duplicate customers who had registered multiple times with slight variations in their email addresses.
Check out np.where in Pandas Python
Advanced Duplicate Detection with Fuzzy Matching
Sometimes, duplicates aren’t exact matches. For customer names with typos or variations, I use fuzzy matching:
# This requires the fuzzywuzzy package
# pip install fuzzywuzzy[speedup]
from fuzzywuzzy import process
# Sample data with name variations
names_data = {
'customer_id': [201, 202, 203, 204, 205],
'name': ['Michael Johnson', 'Mike Johnson', 'Michelle Williams',
'Michelle Wiliams', 'John Kennedy']
}
names_df = pd.DataFrame(names_data)
# Function to find similar names
def find_similar_names(name_series, threshold=80):
unique_names = name_series.unique()
results = []
for name in unique_names:
matches = process.extract(name, unique_names, limit=5)
for match_name, score in matches:
if (match_name != name) and (score >= threshold):
results.append((name, match_name, score))
return pd.DataFrame(results, columns=['Name1', 'Name2', 'Similarity'])
similar_names = find_similar_names(names_df['name'])
print("Potential fuzzy duplicates:")
print(similar_names)This technique has been invaluable when dealing with customer databases where names might be entered inconsistently.
I hope these methods help you effectively identify and handle duplicates in your pandas DataFrames. Finding duplicates is a crucial step in data cleaning and preprocessing, especially when working with customer data or any dataset where duplicate records can skew your analysis.
The right approach depends on your requirements.
Related tutorials you may read:
- Convert a Pandas DataFrame to a List in Python
- Convert a Pandas DataFrame to a Dict Without Index in Python
- Convert a DataFrame to a Nested Dictionary in Python

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.