The Pd.crosstab Function In Python Pandas

When I was working on a data analysis project for a U.S. retail chain where I needed to examine the relationship between customer demographics and purchasing patterns. I needed to create a cross-tabulation of these variables to identify trends. That’s when the pandas crosstab function became my go-to solution.

In this article, I’ll explain what the pandas crosstab function is, how it works, and show you practical examples of how to use it effectively in your data analysis projects.

So, let’s get in..

Table of Contents

Pandas Crosstab

The pandas crosstab function (pd.crosstab) is a useful tool that computes a cross-tabulation of two or more factors. Think of it as creating a frequency table that shows the relationship between different categorical variables.

This function is particularly useful when you want to understand the distribution of data across multiple dimensions or categories.

Basic Syntax of pd.crosstab

Let’s first look at the basic syntax of the pd.crosstab function in Python:

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, 
            aggfunc=None, margins=False, margins_name='All', 
            dropna=True, normalize=False)

The two main parameters you’ll always use are:

index: The categorical variable(s) to use for the table rows
columns: The categorical variable(s) to use for the table columns

1 – Create a Simple Crosstab

Let’s start with a simple example using a dataset of customer information from a U.S. retail chain.

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = {
    'Gender': np.random.choice(['Male', 'Female'], size=100),
    'State': np.random.choice(['California', 'Texas', 'New York', 'Florida'], size=100),
    'Purchase': np.random.choice(['Yes', 'No'], size=100)
}

df = pd.DataFrame(data)

# Create a simple crosstab
result = pd.crosstab(df['Gender'], df['State'])
print(result)

Output:

State    California  Florida  New York  Texas
Gender                                       
Female          11       13        20     12
Male            15       11        10      8

You can refer to the screenshot below to see the output.

This gives us a quick overview of how many male and female customers we have from each state.

2 – Add Margins to Your Crosstab

Sometimes, you want to see the row and column totals. This is where the margins parameter comes in handy:

# Crosstab with margins
result_with_margins = pd.crosstab(df['Gender'], df['State'], margins=True, margins_name='Total')
print(result_with_margins)

Output:

State   California  Florida  New York  Texas  Total
Gender
Female          11       13        20     12     56
Male            15       11        10      8     44
Total           26       24        30     20    100

You can refer to the screenshot below to see the output.

This gives us the total counts for each gender and state, as well as the overall total.

Read Python Pandas Write to Excel

3 – Normalize Your Crosstab

To convert the raw counts into percentages, you can use the normalize parameter in Python:

# Normalize to get percentages
# Normalize by row
row_normalized = pd.crosstab(df['Gender'], df['State'], normalize='index')
print("Row Normalized:")
print(row_normalized)

# Normalize by column
col_normalized = pd.crosstab(df['Gender'], df['State'], normalize='columns')
print("\nColumn Normalized:")
print(col_normalized)

# Normalize by total
total_normalized = pd.crosstab(df['Gender'], df['State'], normalize=True)
print("\nTotal Normalized:")
print(total_normalized)

Output:

Row Normalized:
State   California   Florida  New York     Texas
Gender
Female    0.196429  0.232143  0.357143  0.214286
Male      0.340909  0.250000  0.227273  0.181818

Column Normalized:
State   California   Florida  New York  Texas
Gender
Female    0.423077  0.541667  0.666667    0.6
Male      0.576923  0.458333  0.333333    0.4

Total Normalized:
State   California  Florida  New York  Texas
Gender
Female        0.11     0.13       0.2   0.12
Male          0.15     0.11       0.1   0.08

You can refer to the screenshot below to see the output.

These normalized tables show proportions instead of raw counts, which can be more insightful for comparison.

4 – Use Crosstab with a Values Column

We can also use the crosstab function to aggregate values based on the categorical variables:

# Add a 'PurchaseAmount' column
df['PurchaseAmount'] = np.random.randint(10, 500, size=100)

# Use crosstab with values and an aggregation function
result_with_values = pd.crosstab(df['Gender'], df['State'], 
                                values=df['PurchaseAmount'], 
                                aggfunc='mean',
                                margins=True, 
                                margins_name='Average')
print(result_with_values)

Output:

State    California    Florida   New York      Texas    Average
Gender                                                         
Female    256.714286  242.444444  237.250000  254.307692  248.395833
Male      268.142857  246.454545  248.076923  247.571429  252.980769
Average   262.428571  244.650000  242.840000  250.814815  250.770000

This shows the average purchase amount for each gender across different states, providing more meaningful insights.

Check out np.where in Pandas Python

5 – Create a Multi-Level Crosstab

For more complex analyses, we can create a multi-level crosstab:

# Add an 'Age' column
df['AgeGroup'] = np.random.choice(['18-25', '26-35', '36-50', '50+'], size=100)

# Create a multi-level crosstab
multi_level = pd.crosstab([df['Gender'], df['AgeGroup']], df['State'])
print(multi_level)

Output:

State                California  Florida  New York  Texas
Gender AgeGroup                                          
Female 18-25                 4        2         2      3
       26-35                 2        2         5      5
       36-50                 5        3         3      2
       50+                   3        2         2      3
Male   18-25                 5        2         4      4
       26-35                 3        5         3      2
       36-50                 4        2         4      5
       50+                   2        2         2      3

This gives us a more detailed breakdown of our customer data, showing the distribution across gender, age groups, and states.

Read Pandas Find Duplicates in Python

6 – Visualize Crosstab Results

One of the most useful ways to use Crosstab is to visualize the results:

import matplotlib.pyplot as plt
import seaborn as sns

# Create a heatmap from a crosstab
plt.figure(figsize=(10, 6))
crosstab_result = pd.crosstab(df['Gender'], df['State'])
sns.heatmap(crosstab_result, annot=True, cmap='YlGnBu', fmt='d')
plt.title('Customer Distribution by Gender and State')
plt.tight_layout()
plt.show()

This creates a visual heatmap that makes it much easier to spot patterns in your data.

Practical Applications of pd.crosstab

The pandas crosstab function is incredibly versatile and can be used in various real-world scenarios:

Market Analysis: Analyzing customer demographics and purchasing behavior.
Healthcare: Examining patient outcomes across different treatments and demographics.
Finance: Analyzing transaction patterns across different customer segments.
Education: Studying student performance across different subjects and demographics.
Politics: Analyzing voting patterns across different demographics and regions.

In my experience, the crosstab function is particularly useful when you need to quickly understand the relationship between categorical variables in your dataset.

The methods that I explained in this guide are: creating a simple crosstab, adding margins to your crosstab, normalizing your crosstab, using crosstab with a values column, creating a multi-level crosstab, and visualizing crosstab results. I also discussed some practical applications of pd.crosstab.

I hope this article has helped you understand what the pandas crosstab function is and how you can use it in your data analysis projects.

Other Python articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

The pd.crosstab function in Python Pandas

Pandas Crosstab

Basic Syntax of pd.crosstab

1 – Create a Simple Crosstab

2 – Add Margins to Your Crosstab

3 – Normalize Your Crosstab

4 – Use Crosstab with a Values Column

5 – Create a Multi-Level Crosstab

6 – Visualize Crosstab Results

Practical Applications of pd.crosstab

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends