Percentage Normalization using Crosstab() in Pandas

Percentage normalization using crosstab in pandas is one of the most widely used statistical techniques in Python machine learning. It is used to transform or convert the values in a crosstable into percentages of the row or column totals based on the value we have set to the parameter “normalize”.

Through this Python tutorial, we will cover topics related to percentage normalization using pandas crosstab(). Moreover, we will also see the following topics:

  • What is the Percentage Normalization in pandas crosstab function?
  • Loading the dataset to perform percentage normalization using crosstab()
  • Types of Percentage Normalization in pandas crosstab()
  • Index Normalization using pandas crosstab()
  • Column Normalization using pandas crosstab()
  • All or Total Normalization using pandas crosstab()
  • Advantages of Normalization in pandas crosstab()
  • Limitations of Normalization in pandas crosstab()

If you are new to crosstab() check out an article on Crosstab() in Pandas in Python.

Percentage Normalization in Pandas crosstab()

Normalization is the process of adjusting the values in the table to enable better comparison between different categories. Normalization is all about converting values to percentages based on the row or column total.

The “normalize” parameter in the crosstab() function can be set to either “True”, “index” or “columns”. Normalize is one of the most widely used parameters in pandas Crosstab function.

There are three types of Percentage Normalisation of the crosstab() function using pandas in Python:

  1. Index Normalization
  2. Column Normalization
  3. All or Total Normalization

Before jumping to normalization, let us download a dataset first, in order to perform normalization using crosstab() in python throughout the tutorial.

Let’s consider a real-life example where crosstab percentage normalization can be used to analyze data in machine learning in Python. We can download the dataset from

“https://www.kaggle.com/code/fourbic/visualizing-the-titanic-data-with-seaborn”

or else we can directly load the same dataset from the seaborn to understand the working of the crosstab() function in pandas python.

#Import the necessary libraries 
import pandas as pd
import seaborn as sns

#Load the dataset after downloading manually from kaggle
data=pd.read_csv("titanic.csv")
data.head()
(or)
#Load the dataset using seaborn library without downloading
data=sns.load_dataset("titanic")
data.head()

Below is a titanic dataset that has information about all 891 passengers who traveled during the incident it has data like passengers’ gender, the class they are traveling whether survived or not, fare paid for the ticket, etc.

Normalization using pandas crosstab()
Titanic dataset describing passengers data that traveled during the incident

Index Normalization using Pandas Crosstab() in Python

  • Index Normalization in pandas crosstab() will convert each cell value to the percentage based on the row total.
  • When We set “index” to the parameter in “normalize” in the pandas crosstab() function. It will divide the value of each cell by the row total and will return the percentage.

Index Normalization in pandas crosstab() makes the comparison of the distribution of each category within each row easier.

# Normalizing the data in the crosstable by "index" or "rows"
pd.crosstab(index=data['class'], columns=data['alive'], normalize="index")
  • In this example, out of all the first-class passengers, 37 percent of the passengers are not survived and up to 62 percent of the passengers survived the accident.
  • We can observe that based on the class the passenger traveled the survival rate is dependent. The probability of survival is more in the first class.
Index normalization using pandas crosstab in python
Index Normalization using Pandas crosstab()

Column Normalization using Pandas Crosstab() in Python

  • Column Normalization in pandas crosstab() will convert each cell value to the percentage based on the column total.
  • When We set “columns” to the parameter in “normalize” in the pandas crosstab() function. It will divide the value of each cell by the column total and will return the percentage.

Column Normalization in pandas crosstab() makes the comparison of the distribution of each category within each column easier.

# Normalizing the data in the crosstable by "columns"
pd.crosstab(index=data['class'], columns=data['alive'], normalize="columns")
  • In this example, out of all the surviving passengers, 39 percent are from first class,25 percent are from second class and 34 percent are from third class.
  • We can observe that based on the class the passenger traveled the survival rate is dependent. The probability of survival is more in the first class.
Column Normalization using Pandas Crosstab
Column Normalisation using Pandas crosstab()

All or Total Normalization using Pandas Crosstab() in Python

  • All or Total Normalization in pandas crosstab() will convert each cell value to the percentage based on the sum of total cells.
  • When We set “all” to the parameter in “normalize” in the pandas crosstab() function. It will divide the value of each cell by the total sum and will return the percentage.
  • All or Total Percentage Normalization in pandas crosstab() makes the comparison of the distribution of each category within each cell easier.
# Normalizing the data in the crosstable by "columns"
pd.crosstab(index=data['class'], columns=data['alive'], normalize="all")
  • There are 891 passengers on the titanic ship who traveled during the incident. Out of all the 891 passengers, 15 percent of the passengers who traveled by first class survived.
  • Likewise, the survival rate of third-class passengers who survived is 13 percent.
All or Total Normalization in pandas crosstab
Total Normalisation using Pandas crosstab()

Advantages of Normalization in pandas crosstab()

  • Faster conversion: The “normalize” parameter in pandas crosstab() will easily convert the values to percentages.
  • Better performance: We will be able to analyze more accurately about our data because of the “normalize” parameter in pandas crosstab().
  • Better Visualization: Based on the percentage generated by the “normalize” parameter in crosstab() we can plot piecharts, and area charts and we can have a better understanding of the data.

Limitations of Normalization in pandas crosstab()

  • Unable to detect errors: Errors may occur at any time, We are unable to detect the errors because converting a large amount of data to percentages is a pretty hectic task. It is difficult to find the errors in the dataset while performing analysis and grouping the data.
  • Display less information: After performing aggregation and converting to percentages, it will only consider the numeric values bases on the condition passed inside the function and the rest are ignored.
  • Limited to fewer categories columns: It doesn’t perform well on more features at the same time. It can handle a limited number of features.

Conclusion

Through this Normalizing percentage using pandas crosstab() in the python tutorial, We have covered the topics about percentage normalization using pandas crosstab(), Types of percentage normalization like index, column, and all (or) total normalization pandas crosstab().