Scipy Stats Zscore: Calculate And Use Z-Score

Recently, I was working on a data analysis project where I needed to identify outliers in a large dataset of customer transactions. The challenge was finding a standardized way to determine which values were significantly different from the rest. This is where the z-score calculation in SciPy came to my rescue.

In this article, I’ll share how to use SciPy’s stats module to calculate z-scores, which helps normalize your data and identify values that deviate from the mean. I’ll cover different methods with practical examples that you can immediately apply to your projects.

Table of Contents

What is a Z-Scorez?

A z-score (or standard score) measures how many standard deviations a data point is away from the mean of the dataset. It helps us understand whether an observation is unusual compared to the rest of the data.

The formula for calculating a z-score is:

z = (x - μ) / σ

Where:

x is the value we’re examining
μ is the mean of the dataset
σ is the standard deviation

A z-score of 0 means the data point equals the mean. A positive z-score indicates the data point is above the mean, while a negative z-score means it’s below the mean.

Read SciPy Ndimage Rotate

Calculate Z-Scores in Python Using scipy.stats.zscore

SciPy makes z-score calculation incredibly simple with its stats.zscore() function. Let’s see how to use it with some examples.

Method 1: Basic Z-Score Calculation

Here’s how to calculate z-scores for a simple dataset:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample data: Monthly sales figures ($) for a small business
sales = np.array([5200, 5500, 5100, 6200, 5800, 5700, 5600, 5900, 6100, 7500, 5400, 5300])

# Calculate z-scores
z_scores = stats.zscore(sales)

print("Original sales data:", sales)
print("Z-scores:", np.round(z_scores, 2))

When I ran this, I got:

Original sales data: [5200 5500 5100 6200 5800 5700 5600 5900 6100 7500 5400 5300]
Z-scores: [-0.93 -0.45 -1.1  0.69  0.04  0.12 -0.28  0.2  0.53  2.8 -0.61 -0.77]

I executed the above example code and added the screenshot below.

The z-score of 2.74 for the $7,500 value immediately stands out, suggesting it might be an outlier since it’s more than 2 standard deviations from the mean.

Check out SciPy Convolve

Method 2: Z-Scores with Pandas DataFrames

When working with real datasets, you’ll often use pandas. Here’s how to apply z-scores to a DataFrame:

import pandas as pd
from scipy import stats

# Create a DataFrame with some demographic data
data = pd.DataFrame({
    'Age': [28, 32, 45, 30, 25, 65, 27, 35, 44, 33],
    'Income': [55000, 67000, 85000, 62000, 48000, 95000, 52000, 72000, 81000, 68000],
    'YearsExperience': [2, 5, 15, 4, 1, 30, 3, 7, 16, 6]
})

# Calculate z-scores for each column
z_scores = pd.DataFrame({
    'Age_zscore': stats.zscore(data['Age']),
    'Income_zscore': stats.zscore(data['Income']),
    'YearsExperience_zscore': stats.zscore(data['YearsExperience'])
})

# Join the original data with z-scores
result = pd.concat([data, z_scores], axis=1)
print(result.round(2))

I executed the above example code and added the screenshot below.

This approach is super helpful when analyzing multiple variables at once.

Read SciPy Signal

Method 3: Handle NaN Values in Z-Score Calculation

Sometimes your data might contain missing values. SciPy can handle this with the nan_policy parameter:

import numpy as np
from scipy import stats

# Dataset with missing values (American city temperatures in Fahrenheit)
temperatures = np.array([72.5, 68.3, np.nan, 75.2, 71.8, np.nan, 69.7, 73.1])

# Calculate z-scores while ignoring NaN values
z_scores = stats.zscore(temperatures, nan_policy='omit')

print("Temperatures:", temperatures)
print("Z-scores:", z_scores)

I executed the above example code and added the screenshot below.

The nan_policy='omit' parameter tells SciPy to calculate z-scores by ignoring NaN values.

Check out SciPy Integrate

Method 4: Specify Axis for Multi-dimensional Arrays

When working with multi-dimensional data, you can specify which axis to calculate the z-score along:

import numpy as np
from scipy import stats

# 2D array: student scores in different subjects
# Rows represent students, columns represent subjects (Math, English, Science)
scores = np.array([
    [85, 90, 88],  # Student 1
    [92, 85, 95],  # Student 2
    [78, 82, 84],  # Student 3
    [96, 91, 93]   # Student 4
])

# Calculate z-scores across subjects (axis=1)
z_by_student = stats.zscore(scores, axis=1)
print("Z-scores by student (across subjects):")
print(np.round(z_by_student, 2))

# Calculate z-scores across students (axis=0)
z_by_subject = stats.zscore(scores, axis=0)
print("\nZ-scores by subject (across students):")
print(np.round(z_by_subject, 2))

Setting axis=1 normalizes each student’s scores across their subjects, while axis=0 compares how students performed in each subject.

Read SciPy Misc

Identify Outliers Using Z-Scores

One of the most practical applications of z-scores is detecting outliers. A common threshold is to consider data points with |z-score| > 3 as outliers:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# House prices in a neighborhood (in thousands of dollars)
house_prices = np.array([320, 345, 290, 310, 330, 365, 580, 325, 340, 315, 305, 325])

# Calculate z-scores
z_scores = stats.zscore(house_prices)

# Identify outliers
outliers = np.abs(z_scores) > 3
print("Outliers found:", house_prices[outliers])

# Visualize the data with outliers highlighted
plt.figure(figsize=(10, 6))
plt.scatter(range(len(house_prices)), house_prices, c=['red' if x else 'blue' for x in outliers])
plt.axhline(y=np.mean(house_prices), color='green', linestyle='-', label='Mean')
plt.title('House Prices with Outliers Highlighted')
plt.ylabel('Price (thousands $)')
plt.xlabel('House Index')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In this example, any house with a price that’s more than 3 standard deviations from the mean would be flagged as an outlier.

Read SciPy Stats

When to Use Modified Z-Scores

The standard z-score can be influenced by extreme outliers. For more robust outlier detection, you might want to use modified z-scores, which replace the mean with the median and the standard deviation with the median absolute deviation (MAD):

import numpy as np
from scipy import stats

def modified_zscore(data):
    median = np.median(data)
    mad = stats.median_abs_deviation(data)
    return 0.6745 * (data - median) / mad

# Stock price daily changes (%)
stock_changes = np.array([0.5, 0.7, -0.3, 0.2, -0.5, 0.4, -8.5, 0.6, 0.3, -0.4])

# Calculate standard z-scores
std_zscores = stats.zscore(stock_changes)

# Calculate modified z-scores
mod_zscores = modified_zscore(stock_changes)

# Compare results
for i, (x, std_z, mod_z) in enumerate(zip(stock_changes, std_zscores, mod_zscores)):
    print(f"Value: {x:5.1f}, Standard Z: {std_z:5.2f}, Modified Z: {mod_z:5.2f}")

I’ve found that modified z-scores work better when your data might contain extreme outliers that could skew the mean.

Practical Applications of Z-Scores

Z-scores are incredibly versatile. Here are some real-world applications I’ve used them for:

Anomaly detection in network traffic data
Standardizing features before applying machine learning algorithms
Comparing performances across different metrics (like comparing sales across different product categories)
Creating personalized recommendation systems by normalizing user preferences

For example, when building a fraud detection system, I used z-scores to flag unusual transactions:

import pandas as pd
from scipy import stats

# Transaction amounts by customer
transactions = pd.DataFrame({
    'customer_id': [101, 101, 101, 102, 102, 103, 103, 103, 103, 104],
    'amount': [120, 135, 1500, 85, 90, 250, 275, 260, 240, 175]
})

# Group by customer and calculate z-scores for each customer's transactions
def flag_unusual_transactions(group):
    group['zscore'] = stats.zscore(group['amount'])
    group['is_unusual'] = abs(group['zscore']) > 2
    return group

result = transactions.groupby('customer_id').apply(flag_unusual_transactions)
print(result)

This approach helps identify transactions that are unusual for each specific customer, not just unusual compared to all transactions.

I hope you found this article helpful. Z-scores are one of those simple statistical tools that can be surprisingly powerful in data analysis. Whether you’re cleaning data, finding outliers, or standardizing features for machine learning, the scipy.stats.zscore() function makes it easy to implement in your Python projects.

Other Python articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/