Scipy Stats: Statistical Analysis In Python

Over my 10+ years of Python development, I’ve found that data analysis always comes back to one essential skill: understanding statistics. And when it comes to statistical analysis in Python, SciPy Stats has been my go-to library.

Whether I’m analyzing customer behavior for e-commerce platforms or processing election polling data, Scipy Stats has proven to be an invaluable tool in my Python arsenal.

In this guide, I’ll walk you through everything you need to know about Scipy Stats, from basic functions to advanced techniques. I’ll focus on practical, real-world examples rather than theoretical concepts.

What is SciPy Stats?

Scipy Stats is a module within the larger SciPy ecosystem, specifically designed for statistical functions and probability distributions. It’s built on top of NumPy, making it lightning-fast for numerical computations.

I’ve relied on Scipy Stats for years because it offers over 100 probability distributions, numerous statistical tests, and functions for descriptive statistics.

Install Scipy

Before getting in, let’s make sure you have Scipy installed. It’s easy:

pip install scipy

Once installed, you can import the stats module like this:

from scipy import stats
import numpy as np

Descriptive Statistics with SciPy Stats

Let’s start with the basics. Descriptive statistics help us understand the main features of our data.

Read SciPy Misc

Mean, Median, and Mode

Consider we have customer age data from a US retail chain:

# Sample ages of customers
ages = [24, 32, 45, 32, 56, 28, 37, 42, 32, 41]

# Calculate mean
mean_age = np.mean(ages)
mean_age_scipy = stats.tmean(ages)

# Calculate median
median_age = np.median(ages)
median_age_scipy = stats.mstats.mquantiles(ages, 0.5)[0]

# Calculate mode (with keepdims=True to return array)
mode_age = stats.mode(ages, keepdims=True)

# Output
print(f"Mean age: {mean_age}")
print(f"Median age: {median_age}")
print(f"Mode age: {mode_age.mode[0]} (occurs {mode_age.count[0]} times)")

Output:

Mean age: 36.9
Median age: 34.5
Mode age: 32 (occurs 3 times)

You can see the output in the screenshot below.

This gives us a quick snapshot of our customer demographic.

Standard Deviation and Variance

Let’s continue with our customer data:

# Calculate standard deviation
std_dev = np.std(ages, ddof=1)  # ddof=1 for sample standard deviation
# or with scipy
std_dev_scipy = stats.tstd(ages)

# Calculate variance
variance = np.var(ages, ddof=1)
# or with scipy
variance_scipy = stats.tvar(ages)

print(f"Standard deviation: {std_dev}")
print(f"Variance: {variance}")

Output:

Standard deviation: 9.37431478977412
Variance: 87.8777777777778

You can see the output in the screenshot below.

Standard deviation tells me how spread out the customer ages are from the mean, crucial for understanding if we’re serving a narrow or broad age demographic.

Check out SciPy Integrate

Probability Distributions

SciPy Stats truly shines when working with probability distributions. I frequently use these for simulating data and making predictions.

Normal Distribution

The normal distribution is everywhere in statistics, from modeling IQ scores to stock market returns:

# Create a normal distribution with mean 100, standard deviation 15 (similar to IQ scores)
iq_dist = stats.norm(loc=100, scale=15)

# Generate 1000 random IQ scores
random_iq_scores = iq_dist.rvs(size=1000)

# What's the probability of having an IQ above 130? (considered gifted)
probability_above_130 = 1 - iq_dist.cdf(130)
print(f"Probability of IQ above 130: {probability_above_130:.4f}")

# What IQ score represents the top 2%?
top_two_percent = iq_dist.ppf(0.98)
print(f"IQ score for top 2%: {top_two_percent:.2f}")

Output:

Probability of IQ above 130: 0.0228
IQ score for top 2%: 130.81

You can see the output in the screenshot below.

I’ve used this to model everything from test scores to customer spending patterns.

Binomial Distribution

Perfect for modeling binary outcomes, like election results or A/B testing:

# Modeling election results in a precinct with 1000 voters
# Assuming a 52% chance for Candidate A (national average)
election = stats.binom(n=1000, p=0.52)

# Probability of Candidate A getting more than 525 votes
probability_winning_big = 1 - election.cdf(525)
print(f"Probability of getting more than 525 votes: {probability_winning_big:.4f}")

# Simulate 10 different precincts
precinct_results = election.rvs(size=10)
print(f"Simulated results from 10 precincts: {precinct_results}")

When running A/B tests for websites, I rely on binomial distributions to determine if observed differences are statistically significant.

Hypothesis Testing

Hypothesis testing is where I’ve found Scipy Stats to be most valuable in my career. It helps answer the question: “Is what I’m seeing real, or just random chance?”

T-Tests

Let’s say we’re comparing the effectiveness of two different marketing campaigns in different US states:

# Sales from Campaign A (California)
campaign_a = [12500, 11000, 14500, 13000, 15000, 14000, 13500]

# Sales from Campaign B (New York)
campaign_b = [9500, 11500, 12000, 10000, 9000, 11000, 10500]

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(campaign_a, campaign_b)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Campaign A performed significantly better than Campaign B")
else:
    print("No significant difference between campaigns")

This test has saved me countless hours of debate with marketing teams about which campaign truly performed better.

Check out SciPy Signal

ANOVA (Analysis of Variance)

When comparing more than two groups, ANOVA is my go-to method:

# Sales data from three regions: West Coast, Midwest, East Coast
west_coast = [45000, 42000, 44000, 48000, 46000]
midwest = [38000, 40000, 37000, 43000, 39000]
east_coast = [41000, 44000, 40000, 45000, 42000]

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(west_coast, midwest, east_coast)

print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("There are significant differences between regions")
else:
    print("No significant differences between regions")

I’ve used this to analyze everything from regional sales differences to comparing performance metrics across different server configurations.

Chi-Square Test

The chi-square test is perfect for categorical data analysis:

# Observed data: product preference by age group
# Columns: Product A, Product B, Product C
# Rows: 18-34, 35-54, 55+ age groups
observed = np.array([
    [120, 90, 40],   # 18-34 age group
    [100, 110, 60],  # 35-54 age group
    [70, 100, 80]    # 55+ age group
])

# Perform chi-square test
chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")

if p < 0.05:
    print("There's a significant relationship between age and product preference")
else:
    print("No significant relationship between age and product preference")

This test has been particularly useful when analyzing customer segment preferences for different products or features.

Correlation and Regression

Understanding relationships between variables is a fundamental part of data analysis.

Read SciPy Convolve

Pearson Correlation

Let’s analyze the relationship between advertising spend and sales:

# Ad spend (thousands of dollars)
ad_spend = [5, 7, 10, 12, 15, 18, 20, 22, 25, 30]

# Sales (thousands of dollars)
sales = [25, 29, 33, 40, 42, 51, 53, 55, 61, 70]

# Calculate Pearson correlation
correlation, p_value = stats.pearsonr(ad_spend, sales)

print(f"Correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    if correlation > 0:
        print("There is a significant positive correlation")
    else:
        print("There is a significant negative correlation")
else:
    print("No significant correlation")

I’ve used this analysis countless times to justify marketing budgets by showing the relationship between spend and revenue.

Check out SciPy’s Ndimage Rotate

Linear Regression

Building on correlation, let’s create a predictive model:

# Perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(ad_spend, sales)

# Create a function for the line of best fit
def predict_sales(x):
    return slope * x + intercept

# Predict sales for a $27,000 ad spend
predicted_sales = predict_sales(27)

print(f"Regression equation: Sales = {slope:.2f} × Ad Spend + {intercept:.2f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"Predicted sales for $27,000 ad spend: ${predicted_sales:.2f}k")

This simple regression model has helped me forecast everything from sales projections to user growth based on marketing investments.

Non-Parametric Tests

When data doesn’t follow a normal distribution, I turn to non-parametric tests.

Read SciPy Stats Z-score

Mann-Whitney U Test

This is a great alternative to the t-test when normality assumptions aren’t met:

# Customer satisfaction scores for two different store layouts
layout_a = [4, 5, 3, 4, 5, 5, 4, 3, 5, 4]  # Traditional layout
layout_b = [3, 4, 2, 3, 4, 5, 3, 2, 3, 4]  # New experimental layout

# Perform Mann-Whitney U test
stat, p_value = stats.mannwhitneyu(layout_a, layout_b)

print(f"U-statistic: {stat}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("There's a significant difference in satisfaction between layouts")
else:
    print("No significant difference in satisfaction between layouts")

This has been particularly useful when analyzing survey data where responses tend to be skewed.

Check out SciPy Find Peaks

Confidence Intervals

Reporting point estimates without confidence intervals is like telling half the story:

# Monthly revenue data ($thousands)
revenue = [105, 112, 98, 107, 115, 120, 103, 110, 108, 117, 102, 114]

# Calculate confidence interval for mean revenue
mean_revenue = np.mean(revenue)
ci = stats.t.interval(0.95, len(revenue)-1, loc=mean_revenue, 
                     scale=stats.sem(revenue))

print(f"Mean monthly revenue: ${mean_revenue:.2f}k")
print(f"95% confidence interval: ${ci[0]:.2f}k to ${ci[1]:.2f}k")

This analysis helps me communicate not just the average outcome, but also the range of likely values. Confidence intervals have been crucial when presenting forecasts to stakeholders who need to understand the uncertainty in our estimates.

Read Python SciPy Chi-Square Test

Bootstrap Methods

When theoretical assumptions don’t hold, bootstrapping gives me reliable statistics:

# Customer lifetime value data (may not be normally distributed)
customer_ltv = [230, 540, 185, 295, 410, 950, 270, 380, 320, 1200, 240, 450]

# Bootstrap to estimate mean and its confidence interval
n_bootstraps = 10000
bootstrap_means = np.zeros(n_bootstraps)

for i in range(n_bootstraps):
    # Sample with replacement
    sample = np.random.choice(customer_ltv, size=len(customer_ltv), replace=True)
    bootstrap_means[i] = np.mean(sample)

# Calculate 95% confidence interval
conf_interval = np.percentile(bootstrap_means, [2.5, 97.5])

print(f"Mean Customer LTV: ${np.mean(customer_ltv):.2f}")
print(f"95% Bootstrap CI: ${conf_interval[0]:.2f} to ${conf_interval[1]:.2f}")

I’ve used bootstrapping extensively when working with skewed financial data like customer lifetime values or irregular conversion rates.

Multivariate Analysis with Scipy Stats

When dealing with multiple variables, SciPy Stats continues to deliver powerful tools.

Check out Python SciPy Exponential

Multiple Linear Regression

While Scipy Stats doesn’t directly support multiple regression (I typically use statsmodels for this), we can use it for correlation analysis before building more complex models:

# Data: House price vs. size and age
house_size = [1400, 1800, 1500, 2100, 1900, 2400, 1600, 2300, 2100, 1700]  # sq ft
house_age = [12, 2, 8, 1, 3, 5, 15, 4, 7, 10]  # years
house_price = [210000, 350000, 240000, 420000, 380000, 450000, 220000, 430000, 400000, 260000]

# Calculate correlation matrix
size_price_corr, _ = stats.pearsonr(house_size, house_price)
age_price_corr, _ = stats.pearsonr(house_age, house_price)
size_age_corr, _ = stats.pearsonr(house_size, house_age)

print(f"Correlation between size and price: {size_price_corr:.4f}")
print(f"Correlation between age and price: {age_price_corr:.4f}")
print(f"Correlation between size and age: {size_age_corr:.4f}")

This preliminary analysis helps me identify key relationships before building more complex models to predict home values in different markets.

Practical Applications of SciPy Stats

Let’s look at some real-world applications I’ve implemented using Scipy Stats.

Read Python SciPy Confidence Interval

A/B Testing for Website Conversion

One of my most common use cases is analyzing A/B test results:

# Conversion data for two webpage designs
# Design A: 120 conversions from 1500 visitors
# Design B: 150 conversions from 1500 visitors

# Create data arrays
design_a = np.array([1] * 120 + [0] * 1380)
design_b = np.array([1] * 150 + [0] * 1350)

# Perform z-test for proportions
z_stat, p_value = stats.proportions_ztest(
    [np.sum(design_a), np.sum(design_b)], 
    [len(design_a), len(design_b)]
)

print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Calculate conversion rates
conv_rate_a = np.mean(design_a) * 100
conv_rate_b = np.mean(design_b) * 100
lift = (conv_rate_b - conv_rate_a) / conv_rate_a * 100

print(f"Conversion rate A: {conv_rate_a:.2f}%")
print(f"Conversion rate B: {conv_rate_b:.2f}%")
print(f"Relative improvement: {lift:.2f}%")

if p_value < 0.05:
    print("The difference is statistically significant!")
else:
    print("No significant difference detected")

This approach has helped me make data-driven decisions about website designs that increased conversion rates for numerous e-commerce clients.

Check out Python SciPy Minimize

Anomaly Detection in Time Series

Using Scipy Stats for anomaly detection has saved companies I’ve worked with thousands of dollars by identifying unusual patterns:

# Daily website traffic data for a month
traffic = [1520, 1485, 1530, 1510, 1490, 950, 980, 1505, 1540, 1570, 
           1490, 1520, 1505, 1490, 970, 930, 1550, 1510, 1480, 1520,
           1500, 1510, 1485, 1590, 1600, 1580, 1510, 1500, 950, 970]

# Calculate Z-scores
z_scores = stats.zscore(traffic)

# Identify anomalies (Z-score threshold of 2.5)
anomalies = np.where(np.abs(z_scores) > 2.5)[0]

print("Anomalies detected on days:", anomalies + 1)  # +1 for 1-based indexing
print("Traffic values:", [traffic[i] for i in anomalies])
print("Z-scores:", [z_scores[i] for i in anomalies])

This simple approach flagged unusual drops in weekend traffic, helping us identify a website performance issue affecting mobile users.

Read Python SciPy Freqz

Quality Control in Manufacturing

Scipy Stats has been invaluable for implementing statistical process control:

# Widget dimensions from production line (target: 5.0mm, tolerance: ±0.1mm)
measurements = [5.02, 4.99, 5.01, 5.03, 4.98, 5.04, 5.01, 4.97, 5.02, 5.03,
                5.01, 5.00, 4.99, 5.02, 4.98, 5.03, 5.01, 5.02, 4.99, 5.01]

# Calculate process capability indices
mean = np.mean(measurements)
std = np.std(measurements, ddof=1)
usl = 5.1  # upper specification limit
lsl = 4.9  # lower specification limit

# Process capability (Cp)
cp = (usl - lsl) / (6 * std)

# Process capability index (Cpk)
cpu = (usl - mean) / (3 * std)
cpl = (mean - lsl) / (3 * std)
cpk = min(cpu, cpl)

print(f"Process mean: {mean:.4f}mm")
print(f"Process standard deviation: {std:.4f}mm")
print(f"Process capability (Cp): {cp:.4f}")
print(f"Process capability index (Cpk): {cpk:.4f}")

if cpk >= 1.33:
    print("Process is capable (high quality)")
elif cpk >= 1.0:
    print("Process is marginally capable")
else:
    print("Process is not capable (improvements needed)")

This analysis helped a manufacturing client improve their production process and reduce defect rates by identifying when and where variances were occurring.

Check out Python SciPy Kdtree

Tips for Working with SciPy Stats

After years of working with Scipy Stats, I’ve learned a few tricks that can save you time and headaches:

Always check your assumptions: Many statistical tests assume normality or equal variances. Use stats.shapiro() for normality tests and stats.levene() for equal variance tests.
Visualize your data first: Though not part of Scipy Stats, always pair it with matplotlib or seaborn to visualize your data before analysis.
Use the right test for your data type: Continuous, ordinal, and nominal data require different statistical approaches.
Be cautious with small sample sizes: Most tests become less reliable with small samples. Consider non-parametric alternatives when n < 30.
Report effect sizes, not just p-values: Statistical significance doesn’t always mean practical significance. Report effect sizes like Cohen’s d for a complete picture.

SciPy Stats has been a cornerstone in my data analysis toolkit. From easy descriptive statistics to complex hypothesis testing, it provides a comprehensive suite of statistical tools that can handle most analytical needs.

Remember that statistical analysis is both a science and an art. The tools are powerful, but they require thoughtful application and interpretation. By understanding the strengths and limitations of each statistical method, you’ll be well-equipped to make data-driven decisions that drive real results.

So next time you’re faced with a dataset and need to uncover the story it tells, remember that Scipy Stats is there to help you translate the numbers into actionable insights. Happy analyzing!

Other SciPy-related articles you may read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

Scipy Stats: Statistical Analysis in Python

What is SciPy Stats?

Install Scipy

Descriptive Statistics with SciPy Stats

Mean, Median, and Mode

Standard Deviation and Variance

Probability Distributions

Normal Distribution

Binomial Distribution

Hypothesis Testing

T-Tests

ANOVA (Analysis of Variance)

Chi-Square Test

Correlation and Regression

Pearson Correlation

Linear Regression

Non-Parametric Tests

Mann-Whitney U Test

Confidence Intervals

Bootstrap Methods

Multivariate Analysis with Scipy Stats

Multiple Linear Regression

Practical Applications of SciPy Stats

A/B Testing for Website Conversion

Anomaly Detection in Time Series

Quality Control in Manufacturing

Tips for Working with SciPy Stats

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends