Python SciPy Chi-Square Test

Recently, I worked on a data analysis project where I needed to determine if there was a significant relationship between two categorical variables in my dataset. The chi-squared test was the ideal statistical method for this situation. Since Python’s SciPy library provides a straightforward implementation of this test, I decided to explore it in depth.

In this article, I’ll walk you through how to perform Chi-Square tests using SciPy in Python, with practical examples that make the concepts easy to understand.

So let’s dive in!

This Tutorial Covers:

What is the Chi-Square Test? An Easy Overview

The Chi-Square test is a statistical hypothesis test that helps determine if there’s a significant association between categorical variables. It works by comparing the observed frequencies in your data with what you would expect if there were no relationship between the variables.

Think of it as a “relationship detector” for categorical data. The lower the p-value from your test, the stronger the evidence that the variables are indeed related.

This test is beneficial when you’re working with:

Survey data
Market research
Medical studies
Election analysis

Read Python SciPy Exponential

Set Up Your Environment for Chi-Square Testing

Before diving into the code, let’s make sure you have everything installed:

# Import the necessary libraries
import scipy.stats as stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Method 1: Chi-Square Test for Independence with SciPy

Let’s start with a practical example. Imagine we’re analyzing voting preferences in different age groups in the US to see if age and political preference are related:

# Create a contingency table: rows are age groups, columns are political preferences
observed = np.array([
    [120, 90, 40],   # 18-29 age group (Republican, Democrat, Independent)
    [210, 240, 90],  # 30-49 age group
    [280, 260, 120], # 50-65 age group
    [290, 210, 70]   # 65+ age group
])

# Perform the Chi-Square test
chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)

You can refer to the screenshot below to see the output:

Here’s what each output means:

Chi-square statistic: The higher this value, the more likely the variables are related
P-value: If this is less than 0.05, we typically reject the null hypothesis
Degrees of freedom: Calculated as (rows-1) × (columns-1)
Expected frequencies: What we would expect if there was no relationship

If the p-value is small (typically less than 0.05), we can conclude that age and political preference are associated.

Check out Python SciPy Confidence Interval

Method 2: Chi-Square Goodness of Fit Test

Sometimes we want to check if a sample follows an expected distribution. For example, let’s test if the distribution of blood types in a sample matches the expected US population distribution:

# Observed frequencies of blood types in our sample
observed = np.array([42, 28, 21, 9])  # A, B, O, AB

# Expected proportions based on US population
expected_proportions = np.array([0.44, 0.10, 0.42, 0.04])  # A, B, O, AB

# Calculate expected frequencies (proportions * total sample size)
total_observations = observed.sum()
expected = expected_proportions * total_observations

# Perform the Chi-Square goodness of fit test
chi2, p = stats.chisquare(observed, expected)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")

You can refer to the screenshot below to see the output:

In this case, if the p-value is small, it suggests that the observed blood type distribution in our sample differs significantly from what we would expect in the US population.

Read Python SciPy Minimize

Method 3: Visualize Chi-Square Results with Heatmaps

Sometimes, visualizing the relationship between variables makes it easier to understand the results of a chi-squared test:

# Creating a pandas DataFrame from our original contingency table
political_df = pd.DataFrame(observed, 
                           index=['18-29', '30-49', '50-65', '65+'],
                           columns=['Republican', 'Democrat', 'Independent'])

# Calculate the percentages within each age group
percentages = political_df.div(political_df.sum(axis=1), axis=0) * 100

# Create a heatmap
plt.figure(figsize=(10, 6))
plt.pcolor(percentages, cmap='Blues')
plt.xticks(np.arange(0.5, len(percentages.columns)), percentages.columns)
plt.yticks(np.arange(0.5, len(percentages.index)), percentages.index)

# Add percentage labels
for i in range(len(percentages.index)):
    for j in range(len(percentages.columns)):
        plt.text(j+0.5, i+0.5, f'{percentages.iloc[i, j]:.1f}%', 
                ha='center', va='center')

plt.colorbar(label='Percentage (%)')
plt.title('Political Preference by Age Group')
plt.tight_layout()
plt.show()

# Add a text annotation about the Chi-Square result
if p < 0.05:
    print(f"Chi-Square test indicates a significant relationship (p={p:.4f})")
else:
    print(f"Chi-Square test does not indicate a significant relationship (p={p:.4f})")

You can refer to the screenshot below to see the output:

This visualization helps you see which combinations of age groups and political preferences are more or less common than expected.

Read Python SciPy Freqz

Common Issues When Using Chi-Square Tests

While the Chi-Square test is relatively straightforward, there are some common issues to watch out for:

Small expected frequencies: Chi-Square tests become unreliable when expected frequencies are too small (less than 5 in any cell). If you encounter this issue, consider combining categories or using Fisher’s exact test:

# For 2x2 tables with small expected frequencies
oddsratio, p = stats.fisher_exact([[a, b], [c, d]])
print(f"Fisher's Exact Test p-value: {p:.4f}")

Multiple comparisons: If you’re performing multiple Chi-Square tests, consider adjusting your significance level using the Bonferroni correction:

# If performing n tests, adjusted alpha is:
alpha_adjusted = 0.05 / n

Assumption of independence: Chi-Square tests assume that each observation is independent. Be careful with repeated measures designs.

Check out Python SciPy Stats Multivariate_Normal

Use SciPy’s Chi-Square Distribution Functions

Sometimes you might need more control over the Chi-Square calculations. SciPy provides functions to work directly with the Chi-Square distribution:

# Calculate the critical chi-square value for a given significance level
alpha = 0.05
degrees_of_freedom = 6
critical_value = stats.chi2.ppf(1-alpha, degrees_of_freedom)
print(f"Critical chi-square value: {critical_value:.4f}")

# Calculate the p-value from a chi-square statistic
chi2_statistic = 14.5
p_value = 1 - stats.chi2.cdf(chi2_statistic, degrees_of_freedom)
print(f"P-value: {p_value:.4f}")

These functions are useful when you’re implementing custom statistical procedures or when you need to perform more complex hypothesis testing.

I hope you found this guide to Chi-Square testing with SciPy helpful! The Chi-Square test is a powerful tool in your statistical toolkit, and SciPy makes it easy to implement in Python. Whether you’re analyzing survey data, conducting market research, or exploring relationships in categorical variables, the Chi-Square test can provide valuable insights.

If you have any questions or suggestions about using Chi-Square tests in Python, please let me know in the comments below. Remember, the key to getting reliable results is understanding your data and the assumptions behind the test.

Other Python articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

Python SciPy Chi-Square Test

What is the Chi-Square Test? An Easy Overview

Set Up Your Environment for Chi-Square Testing

Method 1: Chi-Square Test for Independence with SciPy

Method 2: Chi-Square Goodness of Fit Test

Method 3: Visualize Chi-Square Results with Heatmaps

Common Issues When Using Chi-Square Tests

Use SciPy’s Chi-Square Distribution Functions

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends