In this Python Pandas tutorial, I will explain how to drop the columns with NaN or missing values from Pandas DataFrames, and When to drop columns with NaNs from DataFrames.
We might know how to drop columns with NaN values but here in this Python tutorial, we will also get to know when to drop columns with NaN values so that we can get a strong understanding of Python Pandas.
Through this Python Pandas tutorial, We will be covering all the methods related to dropping columns with NaN values from a Pandas DataFrame :
- How to drop columns with NaN from Pandas DataFrames in Python
- Drop Columns with missing values or NaNs in the Pandas DataFrame in Python
- Drop Columns where all cell values within a column are NaN or missing in the DataFrame in Python
- Drop columns with a number of NaN in a DataFrame in Python
- When to drop columns with NaN values in Pandas DataFrame in Python
- Check the Importance of the column before dropping it from a DataFrame
- Percentage of non-missing or non-null values in the columns of Pandas DataFrame
How to drop columns with NaN from Pandas DataFrames in Python
We can drop the columns with NaN from Pandas DataFrames in many ways using in-built functions in Python. NaN stands for Not a Number which generally means a missing value in Python Pandas.
In order to reduce the complexity of the dataset we are dropping the columns with NaN from Pandas DataFrame based on certain conditions. To do that Let us create a DataFrame first.
Create a Pandas DataFrame
Let us create a Pandas DataFrame with multiple rows and with NaN values in them so that we can practice dropping columns with NaN in the Pandas DataFrames.
- Here We have created a dictionary of patients’ data that has the names of the patients, their ages, gender, and the diseases from which they are suffering.
- And later it is passed to the “pandas.DataFrame” function in order to convert it to a data frame or a table i.e in the form of rows and columns.
#Create a dictionary which has names of the patients, their ages, and the diseases
data_dict={"Patient":["Kelvin", "John", "smith", "Robin","Williams","Nick","Anyy","Messi","Jonas","Xavier"],
"Age":[13,71,67,8,56,12,31,3,np.nan,17],
"Gender":[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
"Diesease":["Acidity","Heart Attack","Cancer","Cancer",np.nan,np.nan,"Acidity","Heart Attack","Brain Stroke","Skin Cancer"],
}
#Create a DataFrame using Pandas
Patients_data= pd.DataFrame(data_dict)
Patients_data
If We look at the below image, a new DataFrame “Patients_data” is created and Jonas’s age, Gender of all patients, and Williams and Nick’s disease values are represented as NaN which means Not a Number or None in Python.
Create a DataFrame with NaNs in it using Pandas in Python
Till now, we have created a DataFrame using Pandas in Python.
Total number of missing values or NaNs in each column of a Pandas DataFrame
Here through the below code, we can get the total number of missing values in each column of the DataFrame that we have created i.e from Patients_data.
- The “DataFrame.isna()” checks all the cell values if the cell value is NaN then it will return True or else it will return False. The method “sum()” will count all the cells that return True.
# Total number of missing values or NaN's in the Pandas DataFrame in Python
Patients_data.isna().sum(axis=0)
In the below output image, we can see that there are no NaN values in the Patient column whereas age has 1 NaN value, the Gender column has NaN values only, and the disease column has 2 NaN values.
This is how we can get the total number of missing values from the columns of Pandas DataFrames in Python. Now let us study about dropna() method in Python.
dropna() method in Python Pandas
The method “DataFrame.dropna()” in Python is used for dropping the rows or columns that have null values i.e NaN values.
Syntax of dropna() method in python :
DataFrame.dropna( axis, how, thresh, subset, inplace)
The parameters that we can pass to this dropna() method in Python are:
- axis: It takes two values i.e either 1 or 0
- axis=0, it drops the rows that have missing values
- axis=1 drops the columns that have the missing values
- how: It takes two values i.e either any or all
- how=’any’, if any missing values are present then drop the rows/columns
- how=’all’, if all values are missing values then drop the rows/columns
- thresh: minimum number of non-null values to be present for not dropping
- Subset: Labels along other axes to consider
- inplace: It takes boolean values i.e either True or False
- inplace=’True’ means modify the original DataFrame
- inplace=’False’ means creating a new dataframe and then making changes
Drop Columns with missing values or NaN in the DataFrame
Here, We are dropping all the columns that have NaN or missing values in them.
- The method “dropna()” stands for drop not available values from DataFrames.
- The below code DataFrame.dropna(axis=’columns’) checks all the columns whether it has any missing values like NaN’s or not, if there are any missing values in any column then it will drop that entire column.
# Drop all the columns that has NaN or missing value
Patients_data.dropna(axis='columns')
In the below output image, we can observe that except for the “patient” column from the DataFrame “Patients_data”, all other columns have been dropped because they have NaNs or missing values in them.
This is how we can drop the columns with NaN or missing values in a Pandas DataFrame in Python.
Drop Columns where all cell values within a column are NaN or missing in the DataFrame
Here we are dropping the columns where all the cell values in a column are NaN or missing values in a Pandas Dataframe in Python.
- In the below code, the condition within the dropna() function is how=’all’ checks whether the column has entirely missing values or not.
- If that kind of column exists then it will drop the entire column from the Pandas DataFrame.
# Drop all the columns where all the cell values are NaN
Patients_data.dropna(axis='columns',how='all')
In the below output image, we can observe that the whole Gender column was dropped from the DataFrame in Python. Since the whole column has missing or NaN values only.
This way we can drop the column from a Pandas DataFrame that has all the Null values in it in Python.
Drop columns with a minimum number of non-null values in Pandas DataFrame
Here we are keeping the columns with at least 9 non-null values within the column. And the rest columns that don’t satisfy the following conditions will be dropped from the pandas DataFrame.
- The threshold parameter in the below code takes the minimum number of non-null values within a column.
- Here in the below code, we can observe that the threshold parameter is set to 9 which means it checks every column in the DataFrame whether at least 9 cell values are non-null or not.
# Drop columns which doesn't satisify the minimum threshold condition
Patients_data.dropna(axis='columns',thresh=9)
Here in the below output image, we can observe that the columns gender and disease were dropped because both the columns didn’t have a minimum of 9 non-null values.
This way we can drop the columns from Pandas DataFrames that didn’t have the minimum number of non-null values.
When to drop columns with NaN values in Pandas DataFrame
There are certain factors that we have to consider before dropping the columns that have null values.
Let us think we have a dataset with 1000 rows and 9 columns, 600 rows have missing values or NaN and 6 columns have missing values in it in the dataset. If we drop all the rows and columns that have missing values then we might not have data left to train the model.
Check the Importance of the column before dropping it from a DataFrame
- Before dropping the column we have to check how important that column is, to the DataFrame. We have to know how strongly the column that is going to be dropped is correlated with the output column.
- If there is a strong correlation between them then dropping the column would not be the best option so we will fill in null values with mean/median/mode depending on the data type of the column instead of dropping the entire column.
Percentage of non-missing or non-NaN values in the columns of Pandas DataFrame
- We have to calculate the percentages of non-missing values or non-null within each column. Then we can specify the threshold that tells the minimum percentage of non-missing values for all the columns in Pandas DataFrame.
- If the threshold percentage does not reach then we can drop the column. Instead of dropping all the columns with NaN or missing values, we can set some threshold accordingly and drop it.
Conclusion
Through this Python Pandas tutorial, We have covered topics related to dropping the columns that have NaN or missing values in Pandas DataFrames based on certain conditions in Python like :
- Drop Columns with missing values or NaNs in the Pandas DataFrame in Python
- Drop Columns where all cell values within a column are NaN or missing in the DataFrame in Python
- Drop columns with a number of NaN in a DataFrame in Python
And We also covered the different methods or approaches to achieving this. We also got to know when to drop columns with NaN values in Pandas DataFrame in Python.
Also, we can go through the following Python Pandas tutorials for a better understanding of Pandas.
- How to drop header row of Pandas DataFrame
- Get column index from column name of Pandas DataFrame
- How to get Index values of Pandas DataFrames
Python is one of the most popular languages in the United States of America. I have been working with Python for a long time and I have expertise in working with various libraries on Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… I have experience in working with various clients in countries like United States, Canada, United Kingdom, Australia, New Zealand, etc. Check out my profile.