How to aggregate numeric values using crosstab() in Pandas Python

Through this python tutorial, we will be covering topics from the basic level mostly about how to aggregate numeric values using crosstab() in Pandas Python. The topics that are going to be covered in this python tutorial are :

  • Aggregate parameter in Python crosstab()
  • Aggregating values by count using pandas crosstab()
  • Aggregating values by the sum using pandas crosstab()
  • Aggregating values by the mean using pandas crosstab()
  • Aggregating values by the median using pandas crosstab()
  • Aggregating values by min or max using pandas crosstab()
  • Aggregate values by min and max using pandas crosstab()

What is an aggregate parameter in pandas crosstab()

In python, aggregation is all about grouping the values based on the condition that is passed inside the Pandas crosstab() function and is performed on a large set of values in order to compute the mean, median, count, minimum, and maximum among all the values in the dataset.

Performing aggregation gives a clear set of ideas about our dataset and it has the capability to describe our dataset in an efficient and effective manner.

Values to be passed to aggregate parameter in pandas crosstab()

The aggregate parameter in crosstab() can be set to “count”, “min”, “max”, “mean”, and “median”. By default, the aggregate parameter gives the count.

# Aggregate parameter in pandas python
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, 
aggfunc= "mean", margins=False, margins_name='All', dropna=True, normalize=False)

Here, we have passed “mean” as the value to the aggregate parameter. It will find the mean or average of all the values in the dataset. Likewise, we can pass parameters like median, count, etc…

Dataset Description

We can download the dataset from

“https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset?resource=download”

or else we can directly load the same dataset from the seaborn to understand the working of aggregation parameters in the crosstab() function in pandas python.

#Import the necessary libraries 
import pandas as pd
import seaborn as sns

#Load the dataset after downloading manually from kaggle
data=pd.read_csv("tips.csv")
(or)
#Load the dataset using seaborn library without downloading
data=sns.load_dataset("tips")

#Print random 10 rows in the dataset
print(data.sample(10))

Our “tips” dataset has columns like total_bill, tip, sex, smoker, day, time, and size.

Total_bill is the bill made by a particular customer and the tip is the amount they have offered to the waiter and gender column in the dataset describes whether the customer is male or female, and the smoker feature defines whether the customer smokes or not.

Day and time features in the dataset tell what day and at what time the customer visited the restaurant and the size along with the customer and how many people attended.

aggregation using pandas crosstab()
Printing 10 random rows of tips dataset

There are columns in our dataset that may not be useful for further analysis. So, here we are taking a copy of useful columns and storing it in the “data” variable.

# Creating a dataframe with useful columns from existing dataframe
data=data[["total_bill","sex","time"]].copy()

#Printing random 5 rows of a tips dataset
print(data.sample(5))

We have recreated the dataset with useful features or columns and deleted all other features from the original dataset that won’t be useful for further analysis.

crosstab aggregate function
Dataset with useful columns

Aggregating numeric values by count using pandas crosstab() in python

We can aggregate the values by count which means it will count the number of rows in the dataset based on the condition we have passed inside the function.

When we set “count” to the “aggfunc” parameter in crosstab pandas python, it will return the number of rows or count. Let us understand it through a sample code:

# Aggregating the values by count
agg_count=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc="count")
print(agg_count)

We can observe that there are 124 male customers who joined a restaurant for dinner and 52 females joined for dinner. It is clear that more number of customers are getting during dinner time from the above crosstable.

Aggregating values by count using pandas crosstab
Aggregating values by count using crosstab in pandas

Aggregating numeric values by sum using pandas crosstab() in python

We can aggregate the values by the sum which means it will sum the values in the dataset based on the condition we have passed inside the function.

When we set “sum” to the “aggfunc” parameter in crosstab pandas python. It will return the sum of the values. Let us understand it through a sample code:

# Aggregating the values by sum
agg_sum=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc="sum")
print(agg_sum)

We can observe that male customers made a large amount of bill during both the lunch and dinner times i.e (3256.82) which is almost double to what female customers made i.e (1570.95). And another thing we can observe from the above table is most of the profit is coming from dinner time.

crosstab aggregate function pandas
Aggregating values by sum using crosstab in pandas

Aggregating numeric values by mean using pandas crosstab() in python

We can aggregate the values by the mean which means it will store the average of the values in the dataset based on the condition we have passed inside the function. Aggregating values by mean is the most used technique in order to bring insights from the dataset.

When we set “mean” to the “aggfunc” parameter in crosstab pandas python. It will return the mean or average of the values. Let us understand it through a sample code:

# Aggregating the values by mean
agg_mean=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc="mean")
print(agg_mean)

We can see that male customers who went for lunch made an average bill of 18.048 during lunch time whereas female customers made an average bill of about 16.339 at the same time.

Aggregating values by mean using crosstab pandas
Aggregating values by means using crosstab in pandas

Aggregating numeric values by median using pandas crosstab() in python

We can aggregate the values by the median which means it will store the median of the values in the dataset based on the condition we have passed inside the function. If outliers exist in our dataset, aggregating the values by median is the best option.

# Aggregating the values by median
agg_median=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc="median")
agg_median

Male customers made a median amount of bill that is 16.58 during lunchtime whereas female customers made 13.42 bill which is the median of all the bills made by females during lunch.

Aggregating values by median using crosstab pandas
Aggregating values by median using crosstab in pandas

When we set “median” to the “aggfunc” parameter in crosstab pandas python, it will return the middle value among all the values in the dataset.

Aggregating numeric values by minimum or maximum using pandas crosstab() in python

We can aggregate the values by the minimum or maximum which means it will store the min or max of the values in the dataset based on the condition we have passed inside the function.

Aggregating values by minimum or maximum is one of the most widely used techniques for example if a stakeholder wants to know who is making more sales in that particular domain. we can do it by aggregating the values using the pandas crosstab() function in python.

We have to set “max” to the parameter “aggfunc” in pandas crosstab().

# Aggregating the values by maximum value
agg_max=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc="max")
print(agg_max)

Among males, the highest bill generated during lunchtime is 41.19 among all the bills generated. Likewise, among males, the highest bill generated during lunchtime is 43.11 among all the bills generated.

Aggregating values by min or max using crosstab in pandas
Aggregating values by maximum using crosstab in pandas

In the same way, we can aggregate the values by setting “min” to the parameter “aggfunc” in the crosstab() function in pandas.

Aggregating numeric values by minimum and maximum using pandas crosstab() in python

We can aggregate the values by the minimum or maximum which means it will store the min or max of the values in the dataset based on the condition we have passed inside the function.

Rather than setting one value to the parameter “aggfunc”, we will pass both “min” and “max” as a list to the pandas crosstab() function and it will give us both the results i.e maximum and minimum among the values.

# Aggregating the values by maximum and minimum values
agg_max_min=pd.crosstab(data["sex"],data["time"],values=data["total_bill"],margins="True",aggfunc=["max","min"])
print(agg_max_min)

From the above example, we can observe that 41.19 is the maximum amount paid by the male customer during lunchtime and 7.51 is the minimum amount that is paid by the male customer at the same time.

Aggregating values by both maximum and minimum values using crosstab in pandas
Aggregating values by minimum and maximum using pandas crosstab()

In this way, using pandas crosstab() in python we can create a crosstable that will represent both the maximum and minimum values among the features at the same time.

Conclusion

Through this “aggregating values using pandas crosstab in python” tutorial, we have covered the topics like the importance of aggregate function, and aggregating values by mean, median, count, minimum, and maximum values.

There are functions like pivot() and aggregate() in python pandas that can be used to perform aggregations on a dataset but the majority of people prefer crosstab to aggregate numeric values because of its ease.

You may also like the following pandas tutorials: