Through this Python Pandas tutorial, We will be learning the topics like what are NaN values, and how to drop rows with NaN values in a Pandas DataFrame. In addition, we also get to have a strong idea of why to drop NaN values of a DataFrame in Python.
Moreover, We will also cover the following topics in this Python tutorial:
- A small intro to Null or NaN or missing values in the Pandas DataFrame in Python
- Methods to drop rows with NaN or missing values in Pandas DataFrame
- Drop all the rows that have NaN or missing value in it
- Drop rows that have NaN or missing values in the specific column
- Drop rows that have NaN or missing values based on multiple conditions
- Drop rows that have NaN or missing values based on the threshold
- Why drop NaN or missing values in the Pandas DataFrame in Python
Null or NaN or missing values in Pandas DataFrame
In Pandas, the NaN is the default missing value. And the missing values here can be denoted by different names like NaN, None, Null, etc…
Many of the real-world datasets have missing values in some cells in it. And it might be because the user didn’t enter some cell values while filling out the form or there might be many reasons for the missing values in the DataFrame.
If there are less number of missing values i.e about 5 percent of rows have missing values in our dataset then we can simply drop the rows. To dive into this, let us create a DataFrame for further analysis in Python.
Create a Pandas DataFrame with NaN or missing values in it
Let us create our own Pandas DataFrame with multiple rows and NaN values in it.
Here We have created a dictionary of patients’ data that has the names of the patients, their ages, gender, and the diseases from which they are suffering. And later it is passed to the “pandas.DataFrame” function to convert it to a DataFrame or a table i.e in the form of rows and columns.
# Import necessary libraries
import numpy as np
import pandas as pd
#Create a dictionary which has names of the patients, their ages, and the diseases
data_dict={"Patient":["Kelvin", "John", "smith", "Robin","Williams","Nick","Anyy","Messi","Jonas","Xavier"],
"Age":[np.nan,71,67,8,56,12,31,3,np.nan,17],
"Gender":[np.nan,np.nan,np.nan,np.nan,np.nan,"Male","Male",np.nan,np.nan,np.nan],
"Diesease":[np.nan ,"Heart Attack","Cancer",np.nan,"Heart Attack","Brain Stroke","Acidity", np.nan,"Brain Stroke","Skin Cancer"],
}
#Create a DataFrame using Pandas
Patients_data=pd.DataFrame(data_dict)
Patients_data
In the below output image, we can observe that almost all the cell values in the Gender column are NaN or missing and there are some missing values in the disease and age columns too.
This way we can create our Pandas DataFrame which can be used for our further analysis in Python.
Methods to drop rows with NaN or missing values in Pandas DataFrame
There are different methods in Python that help us in dropping the rows that have NaN or missing values in Pandas DataFrame. Here We will see all the possible of dropping the rows that have NaN or missing values along with examples in Pandas DataFrame in Python. The methods are :
- Drop all the rows that have NaN or missing value in it
- Drop rows that have NaN or missing values in the specific column
- Drop rows that have NaN or missing values based on multiple conditions
- Drop rows that have NaN or missing values based on the threshold
Drop all the rows that have NaN or missing value in Pandas Dataframe
We can drop the missing values or NaN values that are present in the rows of Pandas DataFrames using the function “dropna()” in Python.
The most widely used method “dropna()” will drop or remove the rows with missing values or NaNs based on the condition that we have passed inside the function.
- In the below code, we have called the function ‘dropna()’ which will drop the null values i.e NaN or missing values that are present in the DataFrame “Patients_data“.
- It will only return the rows in the DataFrame that has no Null values(NaN) or missing values in it.
# Drop the rows that has NaN or missing value in it using the method dropna()
Patients_data.dropna()
In the below output image, we can observe that there are only two rows in our DataFrame that don’t have any Null values(NaN) or missing values in it.
This is how we can drop all the rows that have NaN or missing values in the DataFrame in Python.
Drop rows that have NaN or missing values in a specific column in Pandas Dataframe
In this section, We will learn how to drop rows that have NaN or missing values in the specific column of a Pandas DataFrame.
- In the below code, we have passed the ‘Diesease‘ value to the parameter subset for the function dropna().
- This will check the Diesease column, if it has NaN or missing value then the entire row is dropped from the Pandas DataFrame.
# Drop the rows that has NaN or missing value in it based on the specific column
Patients_data.dropna(subset=['Diesease'])
In the actual DataFrame, there are missing values in the Disease column at index positions 0,3, and 7. So, all these rows are dropped in the below output image.
This is how we can drop rows that have NaN or missing values in the specific column in Pandas DataFrame in Python.
Drop rows that have NaN or missing values based on multiple conditions in Pandas Dataframe
Here We are trying to drop the rows based on multiple conditions. Rather than dropping every row that has a null or missing value, We will be writing some conditions like the consideration of the column values to drop the rows in dataframe.
- The code below shows that the “dropna()” method is called. In the subset section, we have passed a list with 2 values i.e Gender and Disease.
- And we also set the ‘all’ value to the parameter ‘how’ in that dropna() function.
- This code will only print the rows with Non-null or non-NaN values in both the columns of Gender and disease of the ‘Patients_data’ DataFrame and drop the rest.
# Drop the rows that have NaN or missing value in it based on the specific columns
Patients_data.dropna(subset=['Gender','Diesease'],how='all')
In the below output image, we can observe that the rows with indexes 0,3,7 are dropped because, in all these rows, the cell values of the Disease and Gender columns both are missing i.e having NaN values.
This is how we can drop rows that have NaN or missing values based on multiple conditions in Python.
Drop rows that have NaN or missing values based on the threshold in Pandas Dataframe
We can even drop the rows with atleast ‘n’ missing values in the DataFrame.
- In the below code, we have passed thresh parameter to the inbuilt pandas function dropna().
- Here, we set the ‘4’ value to the thresh parameter so that the below code will drop all the rows that didn’t reach the threshold i.e according to the code it will return the rows that have a minimum of 4 non-null or non-missing values in each row of the DataFrame.
# Drop the rows that have NaN or missing value in the DataFrame based on the threshold
Patients_data.dropna(thresh=4)
In the below output image, we can observe that there are only 2 rows in the entire DataFrame which have atleast 4 non-missing values in its row in the DataFrame.
This is how we can drop the rows with NaN or missing values based on the threshold in Python Pandas.
Why drop NaN or missing values in the DataFrame in Python
While building any Machine Learning model, one needs to train the model with data, if we train our model with data that has missing values then our model will get confused and won’t give accurate results. Dropping rows that have missing values comes under Data Cleaning.
For this reason, we have to either drop the rows that have missing values or else we have to fill in the missing values. If the training data has less than 5 percent of missing values then we can simply drop the rows of the DataFrame or dataset in Python.
Conclusion
In this Python Pandas tutorial, we have covered topics like :
- What is NaN value?
- Why drop rows that have NaN or missing values?
- Different methods to drop the rows of a DataFrame or dataset with missing values.
We have covered all these topics with examples which makes our learning journey easier to dive into Pandas in Python.
You can check out some Python Pandas tutorials:
- How to get index of rows in Pandas DataFrame
- Pandas add a new column to an existing DataFrame
- Drop non-numeric columns from pandas DataFrame
Python is one of the most popular languages in the United States of America. I have been working with Python for a long time and I have expertise in working with various libraries on Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… I have experience in working with various clients in countries like United States, Canada, United Kingdom, Australia, New Zealand, etc. Check out my profile.