Python SciPy Stats Fit: With Examples

Recently, I was working on a data analysis project where I needed to determine which statistical distribution best fit my dataset. This is a common challenge in data science, particularly when attempting to make predictions or uncover underlying patterns in your data.

The good news is that Python’s SciPy library makes this process simple with its powerful stats module.

In this article, I’ll walk you through how to use SciPy’s stats module to fit various statistical distributions to your data. I’ll cover everything from basic distribution fitting to more advanced techniques using real-world examples.

This Tutorial Covers:

What is SciPy Stats Fit?

SciPy’s stats module offers functions that facilitate fitting your data to various statistical distributions. This is particularly useful when you want to:

Understand the underlying distribution of your data
Make probability-based predictions
Perform hypothesis testing
Generate simulated data that matches your observed data

The process essentially involves finding the distribution parameters that best describe your dataset.

Method 1: Fitting Data to a Normal Distribution

The normal distribution (also known as the Gaussian distribution) is one of the most commonly used distributions in statistics. Here’s how to fit your data to a normal distribution:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data: US adult heights in inches (simulated)
heights = np.random.normal(69.1, 2.9, 1000)  # Mean height of 69.1 inches with std dev of 2.9

# Fit to normal distribution
mu, sigma = stats.norm.fit(heights)

# Create a plot to visualize the fit
x = np.linspace(min(heights), max(heights), 100)
pdf = stats.norm.pdf(x, mu, sigma)

plt.hist(heights, bins=30, density=True, alpha=0.6, color='g', label='Height Data')
plt.plot(x, pdf, 'k', linewidth=2, label=f'Normal Fit: μ={mu:.2f}, σ={sigma:.2f}')
plt.title("US Adult Heights with Normal Distribution Fit")
plt.xlabel("Height (inches)")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

print(f"Fitted parameters: Mean (μ) = {mu:.2f}, Standard Deviation (σ) = {sigma:.2f}")

I executed the above example code and added the screenshot below.

This code fits a normal distribution to a dataset of simulated US adult heights. The stats.norm.fit() function returns the optimal parameters (mean and standard deviation) that best describe your data.

Read Python SciPy Butterworth Filter

Method 2: Test Multiple Distributions

Often, you won’t know which distribution best fits your data. In such cases, you can test multiple distributions and compare their goodness of fit:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd

# Sample data: Annual rainfall in Seattle (in inches)
rainfall = np.array([37.1, 38.6, 35.9, 40.2, 33.7, 39.1, 41.5, 34.8, 36.7, 42.3, 
                     39.8, 37.2, 36.5, 38.9, 42.1, 35.6, 40.7, 39.2, 38.1, 36.9])

# Distributions to try
distributions = [
    stats.norm,       # Normal
    stats.gamma,      # Gamma
    stats.lognorm,    # Log-normal
    stats.weibull_min # Weibull
]

# Fit each distribution and calculate goodness of fit
results = []

for dist in distributions:
    # Fit distribution
    params = dist.fit(rainfall)

    # Calculate Kolmogorov-Smirnov test
    ks_stat, p_value = stats.kstest(rainfall, dist.name, params)

    # Store results
    results.append({
        'distribution': dist.name,
        'parameters': params,
        'ks_statistic': ks_stat,
        'p_value': p_value
    })

# Convert to DataFrame for better visualization
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ks_statistic')

print("Distribution Rankings (lower KS statistic = better fit):")
print(results_df[['distribution', 'ks_statistic', 'p_value']])

# Plot the best fitting distribution
best_dist_name = results_df.iloc[0]['distribution']
best_dist = getattr(stats, best_dist_name)
best_params = results_df.iloc[0]['parameters']

x = np.linspace(min(rainfall) - 5, max(rainfall) + 5, 100)
pdf = best_dist.pdf(x, *best_params)

plt.hist(rainfall, bins=10, density=True, alpha=0.6, color='b', label='Seattle Rainfall Data')
plt.plot(x, pdf, 'r', linewidth=2, label=f'Best Fit: {best_dist_name}')
plt.title("Annual Rainfall in Seattle with Best Distribution Fit")
plt.xlabel("Rainfall (inches)")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

I executed the above example code and added the screenshot below.

This code tests multiple distributions against a dataset of simulated annual rainfall in Seattle and uses the Kolmogorov-Smirnov test to determine which distribution provides the best fit.

Check out Python SciPy IIR Filter

Method 3: Fitting Custom Distributions with curve_fit

Sometimes, you may need to fit your data to a custom distribution or function. You can use the curve_fit function from scipy.optimize for this purpose:

import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Sample data: US stock market daily returns (simulated)
returns = np.random.normal(0.0005, 0.01, 1000)  # Mean daily return of 0.05% with volatility of 1%

# Define a custom distribution function (skewed normal)
def skewed_normal(x, mu, sigma, alpha):
    t = (x - mu) / sigma
    return 2 * stats.norm.pdf(t, 0, 1) * stats.norm.cdf(alpha * t, 0, 1)

# Prepare x and y values for fitting
x_data = np.sort(returns)
y_data = np.linspace(0, 1, len(returns))  # Cumulative frequency

# Fit the custom distribution
params, covariance = curve_fit(skewed_normal, x_data, y_data, p0=[0, 0.01, 0])

mu_fit, sigma_fit, alpha_fit = params

# Plot the results
x = np.linspace(min(returns), max(returns), 100)
fitted_pdf = skewed_normal(x, mu_fit, sigma_fit, alpha_fit)

plt.hist(returns, bins=30, density=True, alpha=0.6, color='skyblue', label='Daily Returns')
plt.plot(x, fitted_pdf, 'r', linewidth=2, 
         label=f'Skewed Normal: μ={mu_fit:.4f}, σ={sigma_fit:.4f}, α={alpha_fit:.2f}')
plt.title("US Stock Market Daily Returns with Skewed Normal Fit")
plt.xlabel("Daily Return")
plt.ylabel("Probability Density")
plt.legend()
plt.tight_layout()
plt.show()

# Print the fitted parameters
print(f"Fitted parameters: μ={mu_fit:.6f}, σ={sigma_fit:.6f}, α={alpha_fit:.4f}")

I executed the above example code and added the screenshot below.

This method allows you to fit more complex distributions that might better represent real-world data, such as financial returns that often exhibit skewness.

Read Python SciPy Sparse

Method 4: Use Maximum Likelihood Estimation (MLE)

For more precise distribution fitting, you can use Maximum Likelihood Estimation:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data: Customer wait times at a US fast food restaurant (in minutes)
wait_times = np.random.exponential(scale=3.5, size=200)  # Mean wait time of 3.5 minutes

# Fit using MLE for exponential distribution
loc, scale = stats.expon.fit(wait_times)

# Create a probability plot to check the fit
plt.figure(figsize=(10, 6))

# Plot 1: Histogram and fitted PDF
plt.subplot(1, 2, 1)
x = np.linspace(0, max(wait_times), 100)
pdf = stats.expon.pdf(x, loc=loc, scale=scale)
plt.hist(wait_times, bins=20, density=True, alpha=0.6, color='orange', label='Wait Times')
plt.plot(x, pdf, 'b', linewidth=2, label=f'Exponential: λ={1/scale:.3f}')
plt.title("Wait Times with Exponential Fit")
plt.xlabel("Wait Time (minutes)")
plt.ylabel("Probability Density")
plt.legend()

# Plot 2: Q-Q plot
plt.subplot(1, 2, 2)
stats.probplot(wait_times, dist=stats.expon(loc=loc, scale=scale), plot=plt)
plt.title("Q-Q Plot for Exponential Fit")

plt.tight_layout()
plt.show()

print(f"Fitted parameters: Location = {loc:.4f}, Scale (mean) = {scale:.4f}")

The Q-Q plot is particularly useful as it shows how well your data matches the theoretical distribution. Points following a straight line indicate a good fit.

Method 5: Bootstrap for Confidence Intervals

When fitting distributions, it’s often important to understand the uncertainty in your parameter estimates. Bootstrapping is a powerful technique for this:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data: US household incomes (in $1000s, simulated)
incomes = np.random.lognormal(mean=3.8, sigma=0.6, size=500)

# Fit lognormal distribution
shape, loc, scale = stats.lognorm.fit(incomes)

# Bootstrapping to get confidence intervals
n_bootstraps = 1000
bootstrap_params = []

for i in range(n_bootstraps):
    # Create bootstrap sample
    bootstrap_sample = np.random.choice(incomes, size=len(incomes), replace=True)
    # Fit distribution
    params = stats.lognorm.fit(bootstrap_sample)
    bootstrap_params.append(params)

# Convert to numpy array for easier calculations
bootstrap_params = np.array(bootstrap_params)

# Calculate 95% confidence intervals
confidence_intervals = {}
for i, param_name in enumerate(['shape', 'loc', 'scale']):
    lower = np.percentile(bootstrap_params[:, i], 2.5)
    upper = np.percentile(bootstrap_params[:, i], 97.5)
    confidence_intervals[param_name] = (lower, upper)

# Plot the data and fitted distribution
x = np.linspace(min(incomes), max(incomes), 1000)
pdf = stats.lognorm.pdf(x, shape, loc, scale)

plt.figure(figsize=(10, 6))
plt.hist(incomes, bins=30, density=True, alpha=0.6, color='purple', label='Income Data')
plt.plot(x, pdf, 'g', linewidth=2, label=f'Lognormal Fit: σ={shape:.2f}, μ={np.log(scale):.2f}')

# Plot some bootstrap samples to show uncertainty
for i in range(10):
    random_idx = np.random.randint(0, n_bootstraps)
    bs_shape, bs_loc, bs_scale = bootstrap_params[random_idx]
    bs_pdf = stats.lognorm.pdf(x, bs_shape, bs_loc, bs_scale)
    plt.plot(x, bs_pdf, 'r', alpha=0.1)

plt.title("US Household Incomes with Lognormal Fit")
plt.xlabel("Annual Income ($1000s)")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

# Print fitted parameters with confidence intervals
print(f"Shape (σ): {shape:.4f} (95% CI: {confidence_intervals['shape'][0]:.4f} - {confidence_intervals['shape'][1]:.4f})")
print(f"Location: {loc:.4f} (95% CI: {confidence_intervals['loc'][0]:.4f} - {confidence_intervals['loc'][1]:.4f})")
print(f"Scale: {scale:.4f} (95% CI: {confidence_intervals['scale'][0]:.4f} - {confidence_intervals['scale'][1]:.4f})")
print(f"Lognormal μ (mean of log): {np.log(scale):.4f}")

This method provides confidence intervals for your distribution parameters, giving you a better understanding of the uncertainty in your estimates.

Read How to use Python SciPy

Method 6: Fitting Multivariate Distributions

Sometimes your data has multiple dimensions that are correlated. In these cases, you’ll need to fit multivariate distributions:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate sample data: Height (inches) and Weight (pounds) for US adults
n_samples = 500
mean = [69, 170]  # Mean height and weight
cov = [[10, 70], [70, 500]]  # Covariance matrix (height and weight are correlated)

# Generate multivariate normal data
data = np.random.multivariate_normal(mean, cov, n_samples)
heights = data[:, 0]
weights = data[:, 1]

# Fit multivariate normal distribution
mu_fit = np.mean(data, axis=0)
cov_fit = np.cov(data, rowvar=False)

# Create a 3D plot to visualize the data and the fitted distribution
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot the original points
ax.scatter(heights, weights, np.zeros_like(heights), color='blue', alpha=0.5, label='Data Points')

# Create a grid for the PDF
h = np.linspace(min(heights), max(heights), 30)
w = np.linspace(min(weights), max(weights), 30)
H, W = np.meshgrid(h, w)

# Stack the grid points
pos = np.empty(H.shape + (2,))
pos[:, :, 0] = H
pos[:, :, 1] = W

# Calculate the PDF values
rv = stats.multivariate_normal(mu_fit, cov_fit)
pdf = rv.pdf(pos)

# Plot the PDF surface
ax.plot_surface(H, W, pdf, cmap='viridis', alpha=0.7)

ax.set_xlabel('Height (inches)')
ax.set_ylabel('Weight (pounds)')
ax.set_zlabel('Probability Density')
ax.set_title('Multivariate Normal Fit for US Adult Height and Weight')

plt.tight_layout()
plt.show()

# Print fitted parameters
print(f"Fitted means: Height = {mu_fit[0]:.2f} inches, Weight = {mu_fit[1]:.2f} pounds")
print(f"Fitted covariance matrix:")
print(cov_fit)
print(f"Correlation coefficient: {cov_fit[0,1]/np.sqrt(cov_fit[0,0]*cov_fit[1,1]):.4f}")

This example demonstrates fitting a multivariate normal distribution to height and weight data, which are naturally correlated variables.

Practical Application: Weather Data Analysis

Let’s apply these methods to a real-world example: analyzing US temperature variations:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data: Daily temperature variations from average in New York (in °F)
temp_variations = np.random.normal(0, 7.5, 365)  # Standard deviation of 7.5°F

# Apply various distributions
distributions = [
    {'name': 'Normal', 'dist': stats.norm},
    {'name': 'Cauchy', 'dist': stats.cauchy},
    {'name': 'Laplace', 'dist': stats.laplace}
]

plt.figure(figsize=(12, 6))

# Plot histogram of data
plt.subplot(1, 2, 1)
plt.hist(temp_variations, bins=25, density=True, alpha=0.6, color='lightblue', 
         label='Temperature Variations')

# Fit and plot each distribution
x = np.linspace(min(temp_variations), max(temp_variations), 100)
for dist_info in distributions:
    dist = dist_info['dist']
    params = dist.fit(temp_variations)
    pdf = dist.pdf(x, *params)
    plt.plot(x, pdf, linewidth=2, label=f"{dist_info['name']}")

plt.title("New York Temperature Variations")
plt.xlabel("Variation from Average (°F)")
plt.ylabel("Probability Density")
plt.legend()

# Create AIC comparison
plt.subplot(1, 2, 2)
results = []

for dist_info in distributions:
    dist = dist_info['dist']
    params = dist.fit(temp_variations)
    log_likelihood = np.sum(dist.logpdf(temp_variations, *params))
    k = len(params)
    aic = 2 * k - 2 * log_likelihood
    
    results.append({
        'Distribution': dist_info['name'],
        'AIC': aic
    })

# Plot AIC comparison
distributions_names = [r['Distribution'] for r in results]
aic_values = [r['AIC'] for r in results]

plt.bar(distributions_names, aic_values, color='lightgreen')
plt.title("AIC Comparison (Lower is Better)")
plt.ylabel("AIC Value")
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

# Print best distribution based on AIC
best_dist = min(results, key=lambda x: x['AIC'])
print(f"Best distribution: {best_dist['Distribution']} (AIC: {best_dist['AIC']:.2f})")

Finding the right statistical distribution to fit your data is a crucial step in many data science and statistical analysis tasks. SciPy’s stats module provides a comprehensive set of tools to make this process simple and effective.

Throughout this article, I’ve demonstrated various methods for fitting distributions to data, from simple normal distributions to complex multivariate models. I’ve also shown how to evaluate the goodness of fit and compare different distributions.

Remember that the best distribution for your data depends on your specific use case and the nature of your data. Always validate your fitted distribution with goodness-of-fit tests and visual inspection before concluding.

Python SciPy Stats Fit: with Examples

What is SciPy Stats Fit?

Method 1: Fitting Data to a Normal Distribution

Method 2: Test Multiple Distributions

Method 3: Fitting Custom Distributions with curve_fit

Method 4: Use Maximum Likelihood Estimation (MLE)

Method 5: Bootstrap for Confidence Intervals

Method 6: Fitting Multivariate Distributions

Practical Application: Weather Data Analysis

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends