Python SciPy Stats Multivariate

Recently, I was working on a data science project analyzing customer behavior patterns across multiple variables. I needed to generate multivariate normal distributions to simulate complex, real-world data scenarios.

The challenge? Creating correlated random variables that accurately represent interconnected features like age, income, and purchasing habits.

SciPy’s multivariate_normal function came to my rescue. In this article, I’ll share practical approaches to using this powerful statistical tool for your data science work.

Table of Contents

Multivariate Normal Distribution

A multivariate normal distribution is essentially an extension of the one-dimensional normal distribution to multiple dimensions.

Unlike a simple normal distribution that deals with a single variable, a multivariate normal distribution handles multiple variables that can be correlated with each other.

This makes it perfect for real-world data modeling where variables rarely exist in isolation.

Read Python SciPy Stats Mode

Get Started with SciPy’s multivariate_normal

Before diving into examples, make sure you have the necessary packages installed:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

The multivariate_normal function is part of SciPy’s stats module, making it accessible through scipy.stats.multivariate_normal.

Method 1: Create a Basic Multivariate Normal Distribution

Let’s start with a simple example of creating a bivariate normal distribution:

# Define mean vector and covariance matrix
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]  # Positive correlation of 0.5

# Create the multivariate normal distribution
mv_norm = stats.multivariate_normal(mean=mean, cov=cov)

Here, I’ve created a distribution where two variables correlate 0.5. This could represent something like the relationship between years of education and income in a dataset.

Check out Python SciPy Eigenvalues

Method 2: Sample from the Distribution

One of the most common uses is generating random samples:

# Generate 1000 random samples
samples = mv_norm.rvs(size=1000)

# Plot the samples
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.5)
plt.axis('equal')
plt.title('1000 samples from bivariate normal distribution')
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.grid(True)
plt.show()

I executed the above example code and added the screenshot below.

This code generates 1000 data points that follow our specified distribution. The scatter plot will visually show the correlation between the two variables.

Read Python SciPy Kdtree

Method 3: Calculate Probability Density Function (PDF)

To find the probability density at specific points:

# Create a grid of points
x, y = np.mgrid[-3:3:.1, -3:3:.1]
pos = np.dstack((x, y))

# Calculate PDF values
pdf_values = mv_norm.pdf(pos)

# Plot the PDF
plt.contourf(x, y, pdf_values)
plt.colorbar()
plt.title('Probability Density Function')
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.show()

I executed the above example code and added the screenshot below.

This visualization helps understand where values are most likely to occur in your distribution.

Read Python SciPy Stats Skew

Method 4: Work with Higher Dimensions

Multivariate normal isn’t limited to just two dimensions:

# 3-dimensional example
mean_3d = [1, 2, 3]
cov_3d = [[1, 0.2, 0.4], [0.2, 2, 0.3], [0.4, 0.3, 3]]

mv_norm_3d = stats.multivariate_normal(mean=mean_3d, cov=cov_3d)

# Generate samples
samples_3d = mv_norm_3d.rvs(size=500)

# 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(samples_3d[:, 0], samples_3d[:, 1], samples_3d[:, 2], alpha=0.6)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.title('3D Multivariate Normal Distribution Samples')
plt.show()

I executed the above example code and added the screenshot below.

This approach is particularly useful for modeling complex systems like stock market movements across multiple sectors or customer preferences across several product features.

Check out Python SciPy Stats Poisson

Method 5: Use CDF for Probability Calculations

The Cumulative Distribution Function (CDF) helps calculate probabilities within regions:

# Calculate probability that both variables are less than 1
probability = mv_norm.cdf([1, 1])
print(f"Probability that both variables are less than 1: {probability:.4f}")

# Custom region probability
lower_bound = [-0.5, -0.5]
upper_bound = [0.5, 0.5]

# Calculate probability in square region
prob_region = mv_norm.cdf(upper_bound) - mv_norm.cdf(lower_bound)
print(f"Probability in the square region: {prob_region:.4f}")

This is helpful for questions like “What’s the probability of a customer being between 25-35 years old AND spending between $50-$100?”

Read Python SciPy Stats Norm

Method 6: Conditional Distributions

We can extract conditional distributions, which is particularly useful in Bayesian statistics:

# Original bivariate normal
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Strong positive correlation
mv_norm = stats.multivariate_normal(mean=mean, cov=cov)

# Given that the first variable equals 1, what's the distribution of the second?
# For bivariate normal, conditional mean is:
x1 = 1
cond_mean = mean[1] + cov[1][0]/cov[0][0] * (x1 - mean[0])
# Conditional variance:
cond_var = cov[1][1] - cov[1][0]**2/cov[0][0]

print(f"Conditional mean of second variable given first = 1: {cond_mean}")
print(f"Conditional variance of second variable: {cond_var}")

# Plot original and conditional distributions
x = np.linspace(-3, 3, 100)
plt.figure(figsize=(10, 6))

# Original marginal distribution of second variable
plt.plot(x, stats.norm(mean[1], np.sqrt(cov[1][1])).pdf(x), 
         'b-', label='Original marginal')

# Conditional distribution
plt.plot(x, stats.norm(cond_mean, np.sqrt(cond_var)).pdf(x), 
         'r--', label='Conditional on x1=1')

plt.legend()
plt.title('Effect of Conditioning on Multivariate Normal')
plt.xlabel('Value of second variable')
plt.ylabel('Probability density')
plt.grid(True)
plt.show()

This technique is invaluable for predictive modeling when you already know some of your variables.

Check out Python SciPy Gamma

Real-World Application: Stock Market Returns

Let’s apply multivariate normal to model daily returns of three major US stock indices:

# Example mean daily returns (percentages)
mean_returns = [0.03, 0.02, 0.025]  # S&P 500, Dow Jones, NASDAQ

# Example covariance matrix (made up but realistic)
cov_returns = [
    [0.01, 0.008, 0.009],   # S&P variance and covariances
    [0.008, 0.012, 0.007],  # Dow Jones variance and covariances
    [0.009, 0.007, 0.015]   # NASDAQ variance and covariances
]

# Create market returns model
market_model = stats.multivariate_normal(mean=mean_returns, cov=cov_returns)

# Simulate 252 trading days (1 year)
annual_returns = market_model.rvs(size=252)

# Calculate cumulative returns
cumulative_returns = np.cumprod(1 + annual_returns/100, axis=0)

# Plot the simulated market performance
plt.figure(figsize=(12, 6))
plt.plot(cumulative_returns[:, 0], label='S&P 500')
plt.plot(cumulative_returns[:, 1], label='Dow Jones')
plt.plot(cumulative_returns[:, 2], label='NASDAQ')
plt.title('Simulated 1-Year Market Performance')
plt.xlabel('Trading Day')
plt.ylabel('Cumulative Return (starting at 1.0)')
plt.legend()
plt.grid(True)
plt.show()

This simulation captures both the individual behavior of each index and its interrelationships, making it more realistic than independent models.

Read Python SciPy ttest_ind

When to Use Multivariate Normal in Your Projects

I’ve found multivariate normal distributions particularly useful in these scenarios:

Financial modeling – Simulating correlated asset returns for portfolio analysis
Customer segmentation – Generating realistic test data with proper correlations between variables
Machine learning – Creating synthetic training data when real data is limited
Risk assessment – Modeling multiple interconnected risk factors

The key advantage is preserving relationships between variables, which single-variable approaches miss entirely.

I hope you found this article helpful for implementing multivariate normal distributions in your Python projects. Whether you’re in finance, data science, or research, this powerful statistical tool can significantly enhance your modeling capabilities.