In this Python tutorial, we will learn about the “Python Scipy Chi-Square Test” which shows the association between categorical variables and cover the following topics.
- Python Scipy Chi-Square Test
- Python Scipy Chi-Square Test of Independence
- Python Scipy Chi-Square Test Contingency Table
- Python Scipy Chi-Square Test Categorical Variables
- Python Scipy Chi-Square Test Normal Distribution
- Python Scipy Chi-Square Test Goodness of Fit
- Python Scipy Chi-Square Test P-Value
Python Scipy Chi-Square Test
One technique to demonstrate a relationship between two categorical variables is to use a chi-square statistic. The Python Scipy has a method chisquare()
for that demonstration in the module scipy.stats
. The method chisquare()
test the null hypothesis that categorical data does have the specified frequencies.
- If the observed or expected frequencies in each group are too low, this test is invalid. The observed and predicted frequencies should all be at least 5, according to a common criterion.
The syntax is given below.
scipy.stats.chisquare(f_obs, ddof=1, f_exp=None, axis=1)
Where parameters are:
- f_obs(array_data): Each category’s observed frequencies.
- ddof(): It is “Delta degrees of freedom”: the p-degrees values of freedom are adjusted. A chi-squared distribution with k – 1 – ddof degrees of freedom is used to calculate the p-value, where k is the number of observed frequencies. ddof has a default value of 0.
- f_exp(): Frequencies expected in each category. The categories are presumed to be equally likely by default.
- axis(int): The test should be applied along the axis of the broadcast result of f_obs and f_ exp. All values in f obs are considered a single data collection if the axis is None.
The method returns chisq
( which is the chi-squared test statistic) and p-value
( which is the p-value of the test) of type ndarray or float.
Let’s take an example by following the below steps:
Import the required libraries using the below python code.
from scipy import stats
Now create an array and consider that the f_obs are specified, the predicted frequencies are assumed constant and provided by the average of the observed frequencies. Use the below code to calculate the chi-square of that array values.
arr = [9,8,12,15,18]
stats.chisquare(arr)
Look at the above output, we have calculated the chi-square or p-value of the array values using the method chisqure()
of Python SciPY.
Read: Scipy Signal – Helpful Tutorial
Python Scipy Chi-Square Test of Independence
The Chi-Square Test of Independence is used to test whether two categorical variables have a significant relationship. The Python SciPy has a method chi2-contingency()
for this kind of test that exists in a module scipy.stats
.
The syntax is given below.
scipy.stats.chi2_contingency(observed, lambda_=None correction=False)
Where parameters are:
- observed(array_data): The table of contingencies. Each category’s observed frequencies (number of occurrences) are listed in the table.
- lambda(str or float): Pearson’s chi-squared statistic is the default statistic computed in this test. Instead, lambda_ allows you to employ a statistic from the Cressie-Read power divergence family.
- correction(boolean): Apply Yates’ correction for continuity if True and the degrees of freedom are one. The correction has the effect of adjusting each observed value by 0.5 in the direction of the expected value.
The method chi2_contingency()
returns chi
(which is a test statistic), p
(which is the p-value of the test), dof
(which is the degree of freedom) and expected(Frequencies to expect).
Let’s take an example by following steps:
Import the required libraries using the below python code.
from scipy import stats
Create an array containing observation values using the below code.
observation = np.array([[5, 5, 5], [8, 8, 8]])
Pass the observation to a method chi2_contengency()
to test of independence of variables using the below code.
stats.chi2_contingency(observation)
This is how to use the method chi2_contingency()
of Python SciPy to test the independence of variables.
Read: Scipy Stats Zscore + Examples
Python Scipy Chi-Square Test Contingency Table
We have learned about chisquare()
from the above sub-section “Python Scipy Chi-Square Test”. Here in this subsection, we will learn about another method chi2_contingency()
that exists in a module scipy.stats
but before moving, first, we will know about the “What is contingency table?”.
In statistics, a contingency table (sometimes known as a crosstab) summarises the connection between multiple categorical variables. We’ll look at a table that illustrates the number of men and women who have purchased various types of fruits.
Apple | Mango | Banana | Strawberry | |
women | 203 | 150 | 190 | 305 |
men | 195 | 170 | 250 | 400 |
sum | 398 | 320 | 440 | 705 |
The test’s goal is to determine whether the two variables gender and fruit preference are connected.
We begin by formulating the null hypothesis (H0), which claims that the variables have no relationship. An alternative hypothesis (H1) is that there is a significant link between the two.
If you don’t know about the hypothesis, then refer to another sub-section of this tutorial which is “Python Scipy Chi-Square Test P-Value”.
Now we will use the method chi2_contingency()
to test the above hypothesis by following the below steps:
Import the required libraries using the below python code.
from scipy.stats import chi2_contingency
Create a table using the array that we have created above using the below code.
tab_data = [[203, 150, 190, 305], [195, 170, 250, 400]]
Perform the test using the method chi2_contingency()
using the below code.
chi2_contingency(tab_data)
From the above output, we can see that the p-value is greater than 0.5 which means there is independence between the above variables gender, and fruits.
Read: Python Scipy Matrix + Examples
Python Scipy Chi-Square Test Categorical Variables
We already know about the chi-square test and whether the relationship between two categorical variables exists or not. To perform this test, we used the method chisquare()
of module scipy.stats
.
Let’s take an example by following the below steps where our data is going to be categorical.
Import the required libraries using the below python code.
import numpy as np
from scipy.stats import chisquare
Create two kinds of data, the first one contains the data about the actual number of hours visitors spend time on the website “pythonguides.com” and the other is the expected time visitor spends on the website using the below code.
obs_data = [3,8,6,5,7,10]
exp_data = [4,6,7,8,1,2]
Pass these two data to a method chisquare()
to test the relationship between these two categorical variables using the below code.
chisquare(obs_data,exp_data)
Looking at the above output, we are interested in the p-value which shows the relationship between categorical variables and the value of the p-value is 9.380 something.
The p-value is greater than 0.5, meaning there is no relationship between the above two categorical variables.
Read: Python Scipy FFT [11 Helpful Examples]
Python Scipy Chi-Square Test Normal Distribution
Here in this section, we will use the method chisquare()
of Python SciPy to test whether the sample belongs to normal distribution or not.
Let’s understand with an example by following the below steps:
Import the required libraries using the below python code.
from scipy import stats
import numpy as np
Generate the data that contains values from a normal distribution using the below code.
norm_data = np.random.normal(10,3,200)
Do the normality test using the below code.
stats.chisquare(norm_data)
From the output, we can see that the p-value is greater than 0.5 which means the sample data belongs to normal distribution.
Read: Scipy Find Peaks – Useful Tutorial
Python Scipy Chi-Square Test Goodness of Fit
To test if a categorical variable follows a predicted distribution, a Chi-Square Goodness of Fit Test is utilized. There is a method chisquare()
within module scipy.stats
that we have learned in the first sub-section of this tutorial.
Let’s take an example by following the below steps:
Import the required libraries using the below python code.
from scipy.stats import chisquare
Suppose we have expected an equal number of visitors on our website pythonguides.com each day of the week. To verify this expectation or hypothesis, the number of customers visiting the website is recorded for each day of the week.
Create two arrays to store the number of visitors on the website for expected and observed visitors using the below code.
exp = [90, 90, 90, 90, 90]
obs = [40, 70, 60, 50, 85]
The Chi-Square Goodness of Fit Test is something we can do with these data to verify the hypothesis. Use the method chisquare()
and the above-created arrays the exp and obs.
chisquare(obs,exp)
The p-value for the Chi-Square test is 2.53, and the Chi-Square test statistic is 60.36.
Remember that the null and alternative hypotheses for a Chi-Square Goodness of Fit Test are as follows:
- H0: The distribution of a variable is based on a hypothesis.
- H1: A variable differs from the expected distribution.
We can reject the null hypothesis since the p-value (2.53) is greater than 0.05. If you don’t know about the hypothesis, then read the below sub-section “Python Scipy Chi-Square Test P-Value”.
Read: Python Scipy Special
Python Scipy Chi-Square Test P-Value
A p-value is used to measure the significance of your results concerning the null hypothesis when doing chi-square tests.
To understand the use of p-value first, we need to know about the hypothesis, An educated assumption about something in the world around us is what a hypothesis is. It should be observable or testable.
There are two kinds of hypotheses:
- Null Hypothesis (H0): A test’s null hypothesis always predicts that no effect or association exists between variables.
- Alternate Hypothesis (H1): An alternate hypothesis is a declaration that two measurable variables have some statistical significance.
Now, that we know about the null hypothesis, Whether we should accept the null or alternate hypothesis, this kind of decision can be made using the p-value.
- In general, if the value of the hypothesis is less than 0.5, it provides significant evidence against the null hypothesis, as the null hypothesis has a less than 5% chance of being true. As a result, the null hypothesis is rejected, whereas the alternative hypothesis is accepted.
- A p-value greater than or equal to 0.05 is not statistically significant and suggests strong support for the null hypothesis. This signifies that the null hypothesis is kept and the alternative hypothesis is rejected.
Also, take a look at some more Python Scipy tutorials.
- Scipy Normal Distribution
- Python Scipy Stats Mode
- Python Scipy Exponential
- Scipy Misc + Examples
- Scipy Rotate Image
- Scipy Constants + Examples
So, in this tutorial, we have learned about the “Python Scipy Chi-Square Test” and covered the following topics.
- Python Scipy Chi-Square Test
- Python Scipy Chi-Square Test of Independence
- Python Scipy Chi-Square Test Contingency Table
- Python Scipy Chi-Square Test Categorical Variables
- Python Scipy Chi-Square Test Normal Distribution
- Python Scipy Chi-Square Test Goodness of Fit
- Python Scipy Chi-Square Test P-Value
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.