How to Drop Duplicates using drop_duplicates() function in Python Pandas

In this Python tutorial, we will learn how to drop duplicates using drop_duplicates() function in python pandas. Datasets used in this blog are either self-created or downloaded from kaggle. Also, we will cover these topics.

  • Python pandas drop duplicates
  • Pandas drop duplicates based on column
  • Python pandas drop duplicates keep last
  • Pandas drop duplicates multiple columns
  • Python Pandas drop duplicates subset
  • Python pandas drop duplicates index
  • Python pandas drop duplicates not working
  • Python pandas drop duplicates from list
  • Python pandas drop duplicates case sensitive

If you are new to Python pandas check out an article on, Pandas in Python.

Python pandas drop duplicates

  • In this section, we will learn everything about how to drop duplicates using drop_duplicates() function in python pandas.
  • While working with the dataset at times situation demands for the unique entries only at that time we have to remove duplicate values from the dataset.
  • Removing duplicates is a part of data cleaning. drop_duplicates() function allows us to remove duplicate values from the entire dataset or from specific column(s)

Syntax:

Here is the syntax of drop_duplicates(). The syntax is divided in few parts to explain the functions potential.

remove duplicates from entire dataset

df.drop_duplicates()

subset is used to remove duplicates from specific column

df.drop_duplicates(subset='column_name')

List of column name is passed in subset to remove duplicates from multiple columns

df.drop_duplicates(subset=['column1', 'column2', 'column3'])

keep option is set to ‘last’ to remove duplicates and keep the last occurrences only

df.drop_duplicates(subset='column_name', keep='last')

keep option is set to ‘first’ to remove duplicates and keep the first occurrences only

df.drop_duplicates(subset='column_name', keep='first')

keep option is set to False to remove all the occurrences of duplicate column(s)

df.drop_duplicates(subset='column_name', keep=False)

Here is the in-detailed description of available options for drop_duplicates() function.

Options Explanation
subsetcolumn label or sequence of labels, optional
column label certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep.
– first: Drop duplicates except for the first occurrence
– last: Drop duplicates except for the last occurrence.
– False: Drop all duplicates.
inplacebool, default False
Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default false
If True, the resulting axis will be labeled 0,1,2…. n-1.

Also you may like, Python Pandas CSV Tutorial.

Python Pandas drop duplicates based on column

  • In this section, we will learn how to drop duplicates based on columns in Python Pandas.
  • To drop the duplicates column wise we have to provide column names in the subset.

Syntax:

In this syntax, we are dropping duplicates from a single column with the name ‘column_name’

df.drop_duplicates(subset='column_name')

Here is the implementation of the drop duplicates based on column on jupyter notebook.

Check out, Missing Data in Pandas in Python.

Python pandas drop duplicates keep last

  • In this section, we will learn about keep option in pandas drop_duplicates(). We will learn how to keep the first or last occurrence of the data.
  • drop_duplicates() function iterates over the rows of a provided column(s). It keeps a track of all the first time occurring data. If the same data occurrences again then it removes it.
  • by default, drop_duplicates() function has keep=’first’.

Syntax:

  • In this syntax, subset holds the value of column name from which the duplicate values will be removed and keep can be ‘first’,’ last’ or ‘False’
  • keep if set to ‘first’, then will keep the first occurrence of data & remaining duplicates will be removed.
  • keep if set to ‘last’, then will keep the last occurrence of data & remaining duplicates will be removed.
  • keep if set to ‘False’, then it will remove all the rows with duplicate values. Both first and last occurrence will be removed.
df.drop_duplicates(subset='column_name', keep='last')

Here is the implementation on jupyter notebook, do read the comments for step by step explanation.

Check out Crosstab in Python Pandas.

Pandas drop duplicates multiple columns

  • In this section, we will learn how to drop duplicates from multiple columns in Python Pandas.
  • List of columns is passed in subset, keep option can be provided as per the need.
  • If output is confusing you please refer to our implementation on jupyter notebook below.

Syntax:

In this syntax, we are dropping duplicates from multiple column with the name ‘column1’, ‘column2’, ‘column3’.

df.drop_duplicates(subset=['column1', 'column2', 'column3']

Implementation on jupyter notebook, to understand better please read the comments provided step by step inside the notebook.

Also, you may like to read, Groupby in Python Pandas.

Python pandas drop duplicates subset

Subset in pandas drop duplicates accepts the column name or list of column names on which drop_duplicates() function will be applied.

Syntax:

In this syntax, first line shows the use of subset for single column whereas second line shows subset for multiple columns.

df.drop_duplicates(subset='column_name')

df.drop_duplicates(subset=['column1', 'column2', 'column3']

Python pandas drop duplicates index

  • In this section we will learn to drop duplicates based on Index.
  • User can define their own indexes & these indexes may have duplicate values.
  • In pandas there is special function to create index ‘pd.index.duplicated()
  • Here is the implementation on jupyter notebook

You may like to read, How to use Pandas drop() function in Python.

Pandas drop duplicates from list

  • In this section, we will learn how to drop duplicates from a list in Python pandas. Here list simply means the list of columns passed in the subset.
  • This section is similar to our above section on ‘Pandas Drop, In this section, we will learn how to drop duplicates from a list.
  • Here list simply means the list of columns passed in the subset.

Syntax:

In this syntax, we are dropping duplicates from multiple column with the name ‘column1’, ‘column2’, ‘column3’.

df.drop_duplicates(subset=['column1', 'column2', 'column3']

Implementation on jupyter notebook, to understand better please read the comments provided step by step inside the notebook.

Python pandas drop duplicates case sensitive

  • In this section, we will learn how to drop duplicates in case of case sensitive in Python Pandas. Case sensitive means some of the values are in uppercase whereas rest of them are in lower case vice-a-versa.
  • You may have noticed that while filling a simple survey form you either enter detail in uppercase or lowercase or with mix set of both. example PythonGuides.
  • When these surveys are processed by DataScientists then it becomes important to make it homogeneous. So they either convert all the data into uppercase or lowercase.
  • In our case we will first convert the dataset into lowercase and then perform the drop_duplicate operation.

Here is the implementation on jupyter notebook, steps are explained with comments and markdowns in the notebook.

You may also like:

In this tutorial, we have learned how to drop duplicates using the drop() function in python pandas. Also, we have covered these topics.

  • How to drop duplicates based on a column in Python pandas
  • How to drop duplicates keep last in Pandas
  • How to drop duplicates multiple columns in Python pandas
  • Python pandas drop duplicates subset
  • How to drop duplicates index in Python pandas
  • How to drop duplicates from list in Python pandas
  • Python pandas drop duplicates case sensitive