Python SciPy Stats Skew

When working with data analysis in Python, understanding the shape and distribution of your data is crucial. One key metric that helps us describe data distribution is skewness, which measures the asymmetry of a probability distribution.

As a Python developer who has spent years analyzing data, I’ve found SciPy’s skew functions to be invaluable tools for understanding data distributions and making informed decisions.

In this article, I’ll walk you through everything you need to know about using Python’s SciPy stats skew functions with practical examples that you can apply to your projects.

What is Skewness?

Skewness measures the asymmetry of a probability distribution. In simple terms, it tells us if the data is tilted to one side.

A distribution with zero skewness is perfectly symmetrical, like a normal distribution (bell curve). When data is skewed positively, it has a longer tail on the right side. Conversely, negative skewness indicates a longer tail on the left.

Here’s what different skewness values tell us:

  • Skewness = 0: Perfectly symmetrical distribution
  • Skewness > 0: Positively skewed (right-tailed)
  • Skewness < 0: Negatively skewed (left-tailed)

Read Python SciPy Stats Poisson

Calculate Skewness with SciPy

SciPy offers several methods for calculating and working with skewness. Let’s explore the most common methods with examples.

Method 1: Use scipy.stats.skew()

The simple way to calculate skewness is to use the skew() function from the scipy.stats module.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Create a sample dataset - US household income (simplified example in $1000s)
household_income = np.array([42, 50, 55, 47, 60, 62, 58, 120, 150, 200, 45, 52, 48, 55, 59])

# Calculate skewness
skewness = stats.skew(household_income)
print(f"Skewness: {skewness:.4f}")

# Visualize the distribution
plt.figure(figsize=(10, 6))
plt.hist(household_income, bins=10, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(np.mean(household_income), color='red', linestyle='dashed', linewidth=1, label=f'Mean: {np.mean(household_income):.2f}')
plt.axvline(np.median(household_income), color='green', linestyle='dashed', linewidth=1, label=f'Median: {np.median(household_income):.2f}')
plt.title(f'US Household Income Distribution (Skewness: {skewness:.4f})')
plt.xlabel('Income ($1000s)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

I executed the above example code and added the screenshot below.

skewnorm.rvs

This example will output a positive skewness value, indicating that our household income data is right-skewed, which is common for income distributions where fewer high-income households are pulling the mean higher than the median.

Method 2: Test for Normality with Skewness

We can use skewness to test if our data follows a normal distribution:

import numpy as np
from scipy import stats

# Generate random data - simulating student test scores (0-100)
np.random.seed(42)  # For reproducibility
normal_scores = np.random.normal(75, 8, 1000)  # Mean 75, SD 8
uniform_scores = np.random.uniform(50, 100, 1000)  # Uniform distribution between 50-100
bimodal_scores = np.concatenate([np.random.normal(60, 5, 500), np.random.normal(90, 5, 500)])

# Calculate skewness for each distribution
normal_skew = stats.skew(normal_scores)
uniform_skew = stats.skew(uniform_scores)
bimodal_skew = stats.skew(bimodal_scores)

print(f"Normal distribution skewness: {normal_skew:.4f}")
print(f"Uniform distribution skewness: {uniform_skew:.4f}")
print(f"Bimodal distribution skewness: {bimodal_skew:.4f}")

# Test for normality using skewness and kurtosis (D'Agostino's K-squared test)
normal_test = stats.normaltest(normal_scores)
uniform_test = stats.normaltest(uniform_scores)
bimodal_test = stats.normaltest(bimodal_scores)

print(f"\nNormal distribution normality test p-value: {normal_test.pvalue:.4f}")
print(f"Uniform distribution normality test p-value: {uniform_test.pvalue:.4f}")
print(f"Bimodal distribution normality test p-value: {bimodal_test.pvalue:.4f}")

I executed the above example code and added the screenshot below.

skewness python

This example demonstrates how different distributions have different skewness values, and how we can use D’Agostino’s K-squared test (which incorporates skewness) to assess normality.

Read Python SciPy Stats Poisson

Method 3: Use SciPy’s Skewed Distributions

SciPy also provides built-in skewed distributions that you can use to model data:

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

# Generate data from a skewed normal distribution
# Simulating average daily temperatures in New York throughout the year
alpha = -3  # Negative for left skew (colder days are more extreme)
data = stats.skewnorm.rvs(alpha, loc=60, scale=15, size=365)  # loc=60 (mean), scale=15 (std)

# Calculate the skewness
skewness = stats.skew(data)
print(f"Skewness of temperature data: {skewness:.4f}")

# Plot the distribution
plt.figure(figsize=(10, 6))
plt.hist(data, bins=20, density=True, alpha=0.7, color='skyblue', edgecolor='black')

# Add the PDF curve
x = np.linspace(min(data), max(data), 1000)
plt.plot(x, stats.skewnorm.pdf(x, alpha, loc=60, scale=15), 'r-', lw=2, 
         label=f'Skew-normal PDF (α={alpha})')

plt.title(f'NYC Daily Temperatures (°F) - Skewness: {skewness:.4f}')
plt.xlabel('Temperature (°F)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

I executed the above example code and added the screenshot below.

stats.skew

This example demonstrates how to generate and visualize data from a skew-normal distribution, which can be useful for modeling asymmetric real-world phenomena, such as temperature variations.

Method 4: Handle Skewness in Data Preparation

When working with machine learning models, highly skewed data can cause problems. Here’s how to detect and transform skewed features:

import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt

# Simulating housing data (US real estate prices and features)
np.random.seed(42)
n = 1000
house_prices = np.random.exponential(scale=250000, size=n)  # Highly skewed house prices
square_footage = np.random.normal(loc=2000, scale=500, size=n)  # More normally distributed

# Create a DataFrame
df = pd.DataFrame({
    'Price': house_prices,
    'SquareFeet': square_footage
})

# Calculate skewness before transformation
price_skew_before = stats.skew(df['Price'])
sqft_skew_before = stats.skew(df['SquareFeet'])

print(f"Price skewness before transformation: {price_skew_before:.4f}")
print(f"Square footage skewness before transformation: {sqft_skew_before:.4f}")

# Apply log transformation to the skewed price variable
df['LogPrice'] = np.log1p(df['Price'])  # log1p to handle zero values

# Calculate skewness after transformation
price_skew_after = stats.skew(df['LogPrice'])
print(f"Price skewness after log transformation: {price_skew_after:.4f}")

# Visualize the transformation
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Original price distribution
axs[0].hist(df['Price'], bins=30, color='skyblue', edgecolor='black')
axs[0].set_title(f'House Prices (Skewness: {price_skew_before:.4f})')
axs[0].set_xlabel('Price ($)')
axs[0].set_ylabel('Frequency')

# Log-transformed price distribution
axs[1].hist(df['LogPrice'], bins=30, color='lightgreen', edgecolor='black')
axs[1].set_title(f'Log-transformed House Prices (Skewness: {price_skew_after:.4f})')
axs[1].set_xlabel('Log(Price)')
axs[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

This example demonstrates how to identify and transform skewed features in a dataset, which is a common preprocessing step in data science workflows.

Check out Python SciPy Stats Norm

Method 5: Compare Skewness Across Groups

When analyzing data across different groups, comparing their skewness can provide valuable insights:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulate income distributions for different US regions
np.random.seed(42)
northeast = np.random.lognormal(mean=11.2, sigma=0.4, size=1000)  # Higher mean, lower spread
south = np.random.lognormal(mean=10.9, sigma=0.5, size=1000)
midwest = np.random.lognormal(mean=11.0, sigma=0.4, size=1000)
west = np.random.lognormal(mean=11.3, sigma=0.6, size=1000)  # Higher mean, higher spread

# Calculate skewness for each region
regions = {
    'Northeast': northeast,
    'South': south,
    'Midwest': midwest,
    'West': west
}

# Calculate and display skewness for each region
for region, data in regions.items():
    skewness = stats.skew(data)
    print(f"{region} income skewness: {skewness:.4f}")

# Visualize with box plots to show distribution shapes
plt.figure(figsize=(12, 6))
plt.boxplot([regions[r] for r in regions.keys()], labels=regions.keys())
plt.title('Income Distribution by US Region')
plt.ylabel('Annual Income ($)')
plt.grid(axis='y', alpha=0.3)
plt.show()

# Create violin plots for more detailed distribution visualization
plt.figure(figsize=(12, 6))
plt.violinplot([regions[r] for r in regions.keys()], showmeans=True, showmedians=True)
plt.xticks(range(1, len(regions)+1), regions.keys())
plt.title('Income Distribution Shape by US Region')
plt.ylabel('Annual Income ($)')
plt.grid(axis='y', alpha=0.3)
plt.show()

This example shows how to compare skewness across different groups (in this case, income distributions across US regions) and visualize the differences.

Read Python SciPy Gamma

Practical Applications of Skewness

Understanding skewness has several practical applications:

  1. Financial Analysis: Asset returns often show skewness, which is crucial for risk management.
  2. Social Science Research: Income and wealth distributions typically have positive skewness.
  3. Machine Learning: Feature engineering often requires transforming skewed variables.
  4. Quality Control: Detecting abnormal distribution shapes in manufacturing processes.
  5. Healthcare: Patient outcome measures may have skewed distributions.

Tips for Working with Skewed Data

Based on my experience, here are some practical tips for handling skewed data:

  1. Always visualize your data first – Histograms and density plots can reveal skewness that summary statistics might miss.
  2. Consider transformations for highly skewed data – Common transformations include:
    • Log transformation: np.log1p(x) (use log1p to handle zeros)
    • Square root: np.sqrt(x) (for moderately skewed positive data)
    • Box-Cox: stats.boxcox(x) (automatically finds optimal transformation parameter)
  3. Be careful with outliers – Sometimes skewness is caused by legitimate outliers that shouldn’t be removed.
  4. Use appropriate statistical tests – For skewed data, non-parametric tests like Mann-Whitney U instead of t-tests often work better.
  5. Report median and IQR – For skewed distributions, median and interquartile range (IQR) are more robust measures than mean and standard deviation.

Let’s see how to implement the Box-Cox transformation:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Generate skewed data - simulating US home sales prices in thousands
np.random.seed(42)
home_prices = np.random.lognormal(mean=5.5, sigma=0.6, size=1000)

# Calculate skewness before transformation
skewness_before = stats.skew(home_prices)
print(f"Skewness before transformation: {skewness_before:.4f}")

# Apply Box-Cox transformation
transformed_data, lambda_value = stats.boxcox(home_prices)

# Calculate skewness after transformation
skewness_after = stats.skew(transformed_data)
print(f"Skewness after Box-Cox transformation: {skewness_after:.4f}")
print(f"Optimal lambda value: {lambda_value:.4f}")

# Visualize before and after
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Original data
axs[0].hist(home_prices, bins=30, color='skyblue', edgecolor='black')
axs[0].set_title(f'Home Prices in $1000s (Skewness: {skewness_before:.4f})')
axs[0].set_xlabel('Price ($1000s)')
axs[0].set_ylabel('Frequency')

# Transformed data
axs[1].hist(transformed_data, bins=30, color='lightgreen', edgecolor='black')
axs[1].set_title(f'Box-Cox Transformed Prices (Skewness: {skewness_after:.4f})')
axs[1].set_xlabel('Transformed Price')
axs[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Read Python SciPy ttest_ind

Common Mistakes When Working with Skewness

Through my years of experience with data analysis, I’ve observed several common mistakes when working with skewness:

  1. Assuming all data should be normally distributed – Many real-world variables are naturally skewed, like income or house prices.
  2. Blindly transforming data – Sometimes, the skewness is an important feature of the data that shouldn’t be removed.
  3. Not considering the domain context – Different fields have different expectations for skewness. What’s “too skewed” in one context may be normal in another.
  4. Ignoring the effect of sample size – Small samples can show apparent skewness by chance.
  5. Using parametric tests with highly skewed data – This can lead to incorrect conclusions.

Advanced Example: Skewness in Time Series Data

Let’s examine how skewness can change over time in financial data:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates

# Simulate daily S&P 500 returns over 5 years
np.random.seed(42)
n_days = 252 * 5  # 5 years of trading days
dates = pd.date_range(start='2018-01-01', periods=n_days, freq='B')

# Simulate returns with changing characteristics
# First part: normal market conditions
returns_1 = np.random.normal(loc=0.0005, scale=0.01, size=n_days//2)

# Second part: crisis period (more negative returns, higher volatility)
returns_2 = np.random.normal(loc=-0.001, scale=0.025, size=n_days//2)
returns_2 = returns_2 - np.abs(np.random.pareto(3, size=n_days//2)) * 0.01  # Add negative skew

# Combine returns
returns = np.concatenate([returns_1, returns_2])

# Create DataFrame
df = pd.DataFrame({
    'Date': dates,
    'Return': returns
})

# Calculate rolling skewness (60-day window)
df['Rolling_Skew'] = df['Return'].rolling(window=60).apply(stats.skew)

# Plot returns and rolling skewness
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

# Plot returns
ax1.plot(df['Date'], df['Return'], color='blue', alpha=0.7)
ax1.set_title('Simulated S&P 500 Daily Returns')
ax1.set_ylabel('Return')
ax1.axvline(x=dates[n_days//2], color='red', linestyle='--', 
           label='Market Regime Change')
ax1.legend()

# Plot rolling skewness
ax2.plot(df['Date'], df['Rolling_Skew'], color='green')
ax2.set_title('60-Day Rolling Skewness')
ax2.set_ylabel('Skewness')
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax2.axvline(x=dates[n_days//2], color='red', linestyle='--')
ax2.set_xlabel('Date')

# Format x-axis to show years
ax2.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
ax2.xaxis.set_major_locator(mdates.YearLocator())

plt.tight_layout()
plt.show()

# Show skewness statistics for different periods
before_crisis = df.iloc[:n_days//2]
during_crisis = df.iloc[n_days//2:]

print(f"Skewness before crisis: {stats.skew(before_crisis['Return']):.4f}")
print(f"Skewness during crisis: {stats.skew(during_crisis['Return']):.4f}")

This example demonstrates how skewness can be tracked over time to detect changes in market regimes, a valuable technique for risk management in finance.

In my experience, skewness is one of those statistical concepts that might seem academic at first but proves incredibly useful in practical data analysis. Whether you’re analyzing financial returns, income distributions, or performance metrics, understanding and properly handling skewness is essential.

By mastering SciPy’s skew functions and learning when and how to address skewed distributions, you’ll be better equipped to draw accurate conclusions from your data and build more robust models.

You may read other SciPy-related articles:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.