Missing Data in Pandas in Python

In this machine learning tutorial, we have learned about missing data in pandas in Python. Also, we have covered these topics.

  • Missing Data in Pandas
  • Missing Data in Pandas DataFrame
  • Time Series in Missing Data Pandas
  • Count Missing Data in Pandas
  • Remove Missing Data in Pandas
  • Interpolate Missing Data in Pandas
  • Impute Missing Data in Pandas

Missing Data Pandas

  • Missing data refers to missing values in the dataset. Dataset is the collection of a huge amount of information that has been recorded over time.
  • This information could be related to anything like customer surveys, species of plants, animals, insects, microbes, natural calamities, internet activities, etc.
  • There are various websites to download datasets. Few examples are Data.gov, Google Public, Buzzfeed News this Kaggle.com is one of the popular websites to download datasets.
  • We see missing space in excel or CSV files for the missing data but when this data is read using pandas then it shows NaN in place of missing data.
  • Here is the sample of how missing data looks like in the CSV file.
machine learning using python pandas missing values
missing data in pandas

Here is an example of same datasets when read using pandas. You can observe NaN in place of empty spaces.

machine learning using python missing values in pandas
machine learning using python missing values in pandas

There are various built-in functions to identify & handle the missing data.

isnull(), notnull()The function is used to identify if datasets has missing value or not. They return boolean values.
dropna()This function removes the row having missing value(s).
fillna()This function fills the missing value with the provided value.
replace()This function replaces the NaN with the provided word
interpolate()This function fills the missing data with some value(s) generated after applying the algorithm. It is better to use interpolate instead of hard coding.

Missing Data Pandas DataFrame

  • In this section, We will learn how to create & handle missing data using DataFrame.
  • Python pandas consider None values as missing values and assigns NaN in place of it.
  • In a DataFrame, we can identify missing data by using isnull(), notnull() functions.
  • isnull() returns True for all the missing values & False for all the occupied values.
  • notnull() returns True for all the occupied values and False for the missing value.
  • To remove all the rows having missing data we use dropna() function.
  • replace() function is used to replace the item(s) with name or value. It takes two popular arguments.
    • to_replace: the value you want to change
    • value: The new value you want to provide
  • Here is the representation of all the mentioned functions.

Time Series Missing Data Pandas

  • Time series data refers to the missing data within some time classification.

Count Missing Data Pandas

  • In this section, we will learn how to count the total number of missing values present in the data.
  • To do so we will use two functions.
    • isnull() – returns true for missing values
    • sum() – returns the count
  • combining both the functions together will give us a total count of missing data in a dataset.
  • df.isnull().sum()

Implementation on jupyter notebook:

Remove Missing Data Pandas

  • Removing missing data is part of data cleaning.
  • Missing Data can only be removed either by filling the space or by deleting the entire row that has a missing value.
  • Space can be filled by hard coding or by using an algorithm.
  • fillna() is a built-in function that can be used to replace all the NaN values.
  • Here is the implementation of fillna() in jupyter notebook
  • missing data row can be removed by using the function dropna().
  • Here is the demonstration of dropna().

Interpolate Missing Data Pandas

  • Interpolate is a powerful function that is used to fill the missing data with some values.
  • Instead of hard coding a value for missing data we can use interpolate function.
  • Interpolate uses a linear method to generate a value to place empty space.
  • Here is the implementation of interpolate using jupyter notebook

Impute Missing Data Pandas

  • Impute missing data simply means using a model to replace missing values.
  • There are more than one ways that can be considered before replacing missing values. Few of them are :
    • A constant value that has meaning within the domain, such as 0, distinct from all other values.
    • A value from another randomly selected record.
    • A mean, median, or mode value for the column.
    • A value estimated by another predictive model.
  • Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.
  • For example, if you choose to impute with median column values, these median column values will need to be stored to file for later use on new data that has missing values.
  • Pandas provide the fillna() function for replacing missing values with a specific value.

You may like the following Python tutorials:

In this tutorial, we have learned about missing data in pandas. Also, we have covered these topics.

  • Missing Data in Pandas
  • Missing Data in Pandas DataFrame
  • Time Series in Missing Data Pandas
  • Count Missing Data in Pandas
  • Remove Missing Data in Pandas
  • Interpolate Missing Data in Pandas
  • Impute Missing Data in Pandas