In this Python tutorial, We will go through the topics like, what is the crosstab() function in Pandas, and how to use crosstab() in Pandas with a few examples.
Moreover, we will also see the following topics on Pandas crosstab() function in Python:
- Crosstab() in Pandas and its syntax
- Introduction to contingency table and What exactly a contingency table is?
- Basic Example of Crosstab in Pandas
- How crosstab() in Pandas works with Datasets
- Functionalities of Crosstab in Pandas
- Cons or Limitations of Crosstab in Pandas
Crosstab() in Pandas
Crosstab() in Pandas computes a simple crosstabulation of two or more factors in Python.
- Crosstab() is a function that is mostly used to perform analysis on the dataset like summarizing and describing the dataset by grouping the rows and columns based on the conditions that we have passed inside the function.
- We have several functions like group by(), pivot(), and aggregate() in python but still many people prefer crosstab() for aggregating numeric values because of its ease.
- Crosstab accepts both pandas data frames and NumPy arrays. Mostly, it is used to view normalized data in machine learning.
Crosstab() Function Syntax
Below is the crosstab() syntax.
# Syntax of crosstab() function in pandas python
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
- “index” parameter in pandas crosstab() groups the values by rows from the dataset
- “columns” parameter in pandas crosstab() groups the values by columns from the dataset
- “rownames” parameter is to set some names to the rows that are getting displayed
- “colnames” parameter to sets some names to the columns that are getting displayed
- “aggfunc” parameter is used to group the values in pandas crosstab()
- “margins” parameter is to display the total sum of each row and column in the crosstable
- “dropna” parameter is to drop the null values in the dataset while creating a crosstable
- “normalize” parameter will convert the values to percentages
Contingency table:
The contingency table is nothing but the output that we are getting after applying the crosstab() function to the dataset in python. It is the table that describes and summarizes the dataset based on the conditions that we have passed inside the function.
A contingency table in pandas is a cross-tabulation table or a two-way frequency table, used to summarize the relationship between two categorical variables in Python.
Through this python crosstab() tutorial we will be diving deeply and understanding all the parameters in the crosstab() function.
crosstab() in Pandas example
Let us create a data frame that contains information about the Proficiency levels of different genders in different languages.
Here is an example of how to use crosstab in Python Pandas. In this example, first, we need to Initialize the numpy arrays like Language, Gender, and Proficiency like below:
#Intialize the arrrays to create a sample dataframe and see the insights using crosstab()
Languages=np.array(["English","French","English","English","English","German","French","German","German","French","German","German","German","French","German","French"])
Gender=np.array(["Male","Female","Male","Male","Male","Female","Male","Male","Male","Female","Male","Male","Female","Male","Female","Female"])
Proficiency_Level=np.array(["good","average", "below average","good","average","good","good","below average","average","average","average","good","below average","below average","average","average"])
Suppose We want to know how many females in the data frame are average proficient in a language like french. For this, we have to apply the crosstab() function to it.
# Calling the crosstab() function
d=pd.crosstab(Languages,[Proficiency_Level,Gender])
print(d)
- While creating NumPy arrays, it was like a mess. But when we applied the crosstab() function to it. It is bringing some beautiful insights from the mess.
- In the below output, we can observe that there are no females who have good proficiency in a language like English whereas there are two males who have good proficiency in English in the data frame we created.
- In this way, we can use pandas crosstab() in python to group the values based on the condition.
Check out, Percentage Normalization using Crosstab() in Pandas
Working with crosstab() in Pandas with Datasets
In order to perform analysis on datasets using functions like crosstab, we need to follow the below steps:
Step 1: Import the Dataset to create crosstable using pandas
So, now you have a basic idea about Pandas crosstab. Why too late, come let’s apply the functions we have learned to the datasets and bring meaningful insights.
We can download the dataset from
“https://www.kaggle.com/code/fourbic/visualizing-the-titanic-data-with-seaborn”
or else we can directly load the same dataset from the seaborn to understand the working of the crosstab() function in pandas python.
#Import the necessary libraries
import pandas as pd
import seaborn as sns
#Load the dataset after downloading manually from kaggle
data=pd.read_csv("titanic.csv")
data.head()
(or)
#Load the dataset using seaborn library without downloading
data=sns.load_dataset("titanic")
data.head()
- Here we are importing all the necessary libraries that might be used in further analysis.
- And then we are loading the datasets either by downloading them from Kaggle or directly loading them from the seaborn library.
- Then we are printing the top five rows in the dataset so that we get to know what our dataset looks like.
- This is a titanic dataset that has information about all 891 passengers who traveled during the incident it has data like passengers’ gender, the class they are traveling whether survived or not, fare paid for the ticket, etc.
Step 2: Call the function crosstab() in Python using Pandas
Suppose we want to know the probability of surviving while traveling in first class on the titanic ship. There comes the role of the crosstab() function in pandas in python. Let us have a look at the example:
# Calling crosstab( ) function using pandas
data_aftergrouping=pd.crosstab(data["class"],data["alive"])
data_aftergrouping
Here we are passing features like class as the rows and alive as the columns. The pandas crosstab() will group the values based on this condition.
- If we look at the output that is displayed below, we can observe that there are 136 people who traveled in the first class and survived the incident.
- Likewise, there are 80 people who traveled in the same class and were unable to survive. This is how the crosstab() function in python pandas works.
Functionalities or Parameters of crosstab() in Pandas
The Parameters that we pass inside the crosstab() function in python have different functionalities. Here are some functionalities of crosstable that are being used more rigorously:
- Aggregating values
- Adding margins
- Normalizing
- Controlling output format
1. Aggregating values using crosstab()
We will pass the aggregate parameter in the crosstab() function in python pandas to group the values and it describes and summarizes the data. Whenever we pass aggregate parameter in crosstab() we have to pass parameters like values along with index and columns to the crosstab() function in pandas.
# Aggregating Values
pd.crosstab(index=data['class'], columns=data['alive'], values=data['fare'], aggfunc='count')
Here We have set “count” to the aggfunc which means it will return the count that satisfies the mentioned condition. Let us look at an example to understand it in a better way.
- If we look at the output that is displayed below, we can observe that there are 136 people who traveled in the first class and survived the incident.
- Here it is returning the count of people who survived the incident because we have set the count to the aggregate parameter.
2. Adding Margins using crosstab()
If we want to know the row and column total of the cross table then we have to set the margins parameter in the crosstab() function to “True”. Let us look at an example to understand it in a better way.
# Adding margins
pd.crosstab(index=data['class'], columns=data['alive'], values=data['fare'], aggfunc='count',margins="True")
To the same above code, we have added a parameter called margins. It will display the row and column total as we set it to True.
- If we look at the output that is displayed below, we can observe that there are a total of 216 people who traveled in the first class and 184 in the second class, 491 in the third class.
- There are 321 people who survived the incident of 891 people on the ship.
3. Normalizing the data using crosstab()
We can normalize the data in the contingency table too. The “normalize” parameter will convert the values to percentages based on the values we set if we set it to “columns” then it will calculate the percentages of every value in table based on the column total.
We can set values like “index”, “columns”, and “all” to the parameter called “normalize” in the crosstab() function in pandas python. Let us understand it through a sample example.
# Normalizing the data in the crosstable
pd.crosstab(index=data['class'], columns=data['alive'], values=data['fare'], aggfunc='count', normalize="all")
- If we look at the output that is displayed below, we can observe that 41 percent of the people who traveled in the third class were not survived and 13 percent of people in the same class survived.
- Here we have set “all” to the “normalize” parameter. So it is calculating the total percentages.
4. Controlling output format using crosstab()
We can format the output of the crosstab table by using the pd.options.display() method in python pandas. For example, We can set the maximum number of rows or columns that are to be displayed in the output crosstable.
We can also display the values in crosstable up to some extent by passing ‘{:.2f}’, it will display values up to 2 decimals.
Let us try to understand it through an example:
# Controlling the output format
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.float_format = '{:.2f}'.format
pd.crosstab(index=data['class'], columns=data['sex'], values=data['survived'], aggfunc='sum', margins=True, normalize='index')
All these are the pros of Crosstab function. Like there is a flip side to a coin, there are cons of Crosstab too. We will go through the cons now.
Cons of using pd.crosstab() in Pandas
- Limited to variables: The crosstab function only works with the categorial data and if we try to pass numeric variables like continuous data to the function then instead of displaying meaningful insights it will display a mess.
- Poor Customisation: Sometimes it is difficult to errors in the crosstable that we have generated because of its poor and limited customization.
- Limited to size: It works better with data that has a lesser number of features and rows because it may take lots of time to perform aggregation using crosstab() in python pandas.
Conclusion
Through this tutorial, we have covered the topics like crosstab() in pandas, contingency table, functionalities of crosstab, and cons or limitations of the crosstab in Python using pandas. Here is the list of topics we covered.
- Crosstab in Pandas and its syntax
- What exactly a contingency table in pandas crosstab() is?
- A basic example of crosstab in Pandas
- Working on crosstab in Pandas with Datasets
- Functionalities of Crosstab in Pandas
- Cons or Limitations of Crosstab in Pandas
You may like the following Python Pandas tutorials:
- Python Pandas Write to Excel
- How to Subset a DataFrame in Python
- How to update column values in Python Pandas
- How to delete a column in pandas
- Create Plots using Pandas crosstab() in Python
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.