Python SciPy Chi-Square Test

Recently, I worked on a data analysis project where I needed to determine if there was a significant relationship between two categorical variables in my dataset. The chi-squared test was the ideal statistical method for this situation. Since Python’s SciPy library provides a straightforward implementation of this test, I decided to explore it in depth.

In this article, I’ll walk you through how to perform Chi-Square tests using SciPy in Python, with practical examples that make the concepts easy to understand.

So let’s dive in!

What is the Chi-Square Test? An Easy Overview

The Chi-Square test is a statistical hypothesis test that helps determine if there’s a significant association between categorical variables. It works by comparing the observed frequencies in your data with what you would expect if there were no relationship between the variables.

Think of it as a “relationship detector” for categorical data. The lower the p-value from your test, the stronger the evidence that the variables are indeed related.

This test is beneficial when you’re working with:

  • Survey data
  • Market research
  • Medical studies
  • Election analysis

Read Python SciPy Exponential

Set Up Your Environment for Chi-Square Testing

Before diving into the code, let’s make sure you have everything installed:

# Import the necessary libraries
import scipy.stats as stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Method 1: Chi-Square Test for Independence with SciPy

Let’s start with a practical example. Imagine we’re analyzing voting preferences in different age groups in the US to see if age and political preference are related:

# Create a contingency table: rows are age groups, columns are political preferences
observed = np.array([
    [120, 90, 40],   # 18-29 age group (Republican, Democrat, Independent)
    [210, 240, 90],  # 30-49 age group
    [280, 260, 120], # 50-65 age group
    [290, 210, 70]   # 65+ age group
])

# Perform the Chi-Square test
chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)

You can refer to the screenshot below to see the output:

chi square test python

Here’s what each output means:

  • Chi-square statistic: The higher this value, the more likely the variables are related
  • P-value: If this is less than 0.05, we typically reject the null hypothesis
  • Degrees of freedom: Calculated as (rows-1) × (columns-1)
  • Expected frequencies: What we would expect if there was no relationship

If the p-value is small (typically less than 0.05), we can conclude that age and political preference are associated.

Check out Python SciPy Confidence Interval

Method 2: Chi-Square Goodness of Fit Test

Sometimes we want to check if a sample follows an expected distribution. For example, let’s test if the distribution of blood types in a sample matches the expected US population distribution:

# Observed frequencies of blood types in our sample
observed = np.array([42, 28, 21, 9])  # A, B, O, AB

# Expected proportions based on US population
expected_proportions = np.array([0.44, 0.10, 0.42, 0.04])  # A, B, O, AB

# Calculate expected frequencies (proportions * total sample size)
total_observations = observed.sum()
expected = expected_proportions * total_observations

# Perform the Chi-Square goodness of fit test
chi2, p = stats.chisquare(observed, expected)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")

You can refer to the screenshot below to see the output:

chi square python

In this case, if the p-value is small, it suggests that the observed blood type distribution in our sample differs significantly from what we would expect in the US population.

Read Python SciPy Minimize

Method 3: Visualize Chi-Square Results with Heatmaps

Sometimes, visualizing the relationship between variables makes it easier to understand the results of a chi-squared test:

# Creating a pandas DataFrame from our original contingency table
political_df = pd.DataFrame(observed, 
                           index=['18-29', '30-49', '50-65', '65+'],
                           columns=['Republican', 'Democrat', 'Independent'])

# Calculate the percentages within each age group
percentages = political_df.div(political_df.sum(axis=1), axis=0) * 100

# Create a heatmap
plt.figure(figsize=(10, 6))
plt.pcolor(percentages, cmap='Blues')
plt.xticks(np.arange(0.5, len(percentages.columns)), percentages.columns)
plt.yticks(np.arange(0.5, len(percentages.index)), percentages.index)

# Add percentage labels
for i in range(len(percentages.index)):
    for j in range(len(percentages.columns)):
        plt.text(j+0.5, i+0.5, f'{percentages.iloc[i, j]:.1f}%', 
                ha='center', va='center')

plt.colorbar(label='Percentage (%)')
plt.title('Political Preference by Age Group')
plt.tight_layout()
plt.show()

# Add a text annotation about the Chi-Square result
if p < 0.05:
    print(f"Chi-Square test indicates a significant relationship (p={p:.4f})")
else:
    print(f"Chi-Square test does not indicate a significant relationship (p={p:.4f})")

You can refer to the screenshot below to see the output:

chi-square test of independence python

This visualization helps you see which combinations of age groups and political preferences are more or less common than expected.

Read Python SciPy Freqz

Common Issues When Using Chi-Square Tests

While the Chi-Square test is relatively straightforward, there are some common issues to watch out for:

  1. Small expected frequencies: Chi-Square tests become unreliable when expected frequencies are too small (less than 5 in any cell). If you encounter this issue, consider combining categories or using Fisher’s exact test:
# For 2x2 tables with small expected frequencies
oddsratio, p = stats.fisher_exact([[a, b], [c, d]])
print(f"Fisher's Exact Test p-value: {p:.4f}")
  1. Multiple comparisons: If you’re performing multiple Chi-Square tests, consider adjusting your significance level using the Bonferroni correction:
# If performing n tests, adjusted alpha is:
alpha_adjusted = 0.05 / n
  1. Assumption of independence: Chi-Square tests assume that each observation is independent. Be careful with repeated measures designs.

Check out Python SciPy Stats Multivariate_Normal

Use SciPy’s Chi-Square Distribution Functions

Sometimes you might need more control over the Chi-Square calculations. SciPy provides functions to work directly with the Chi-Square distribution:

# Calculate the critical chi-square value for a given significance level
alpha = 0.05
degrees_of_freedom = 6
critical_value = stats.chi2.ppf(1-alpha, degrees_of_freedom)
print(f"Critical chi-square value: {critical_value:.4f}")

# Calculate the p-value from a chi-square statistic
chi2_statistic = 14.5
p_value = 1 - stats.chi2.cdf(chi2_statistic, degrees_of_freedom)
print(f"P-value: {p_value:.4f}")

These functions are useful when you’re implementing custom statistical procedures or when you need to perform more complex hypothesis testing.

I hope you found this guide to Chi-Square testing with SciPy helpful! The Chi-Square test is a powerful tool in your statistical toolkit, and SciPy makes it easy to implement in Python. Whether you’re analyzing survey data, conducting market research, or exploring relationships in categorical variables, the Chi-Square test can provide valuable insights.

If you have any questions or suggestions about using Chi-Square tests in Python, please let me know in the comments below. Remember, the key to getting reliable results is understanding your data and the assumptions behind the test.

Other Python articles you may also like:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.