Python SciPy Pairwise Distance - Compute Distances Between Point Sets

Recently, I was working on a machine learning project where I needed to calculate distances between multiple data points. The challenge was finding an efficient way to compute these distances without writing complex loops. That’s when I discovered SciPy’s pairwise distance functionality.

In this article, I’ll share how to use SciPy’s spatial distance functions to calculate pairwise distances between observations in your datasets. I’ll cover multiple methods and distance metrics that you can use for various applications.

So let’s dive in!

Table of Contents

What is Pairwise Distance?

Pairwise distance is simply the distance between pairs of observations or points in your dataset. It’s a fundamental calculation used in:

Clustering algorithms
Nearest neighbor searches
Feature extraction
Similarity measurements
Anomaly detection

SciPy makes these calculations simple with its scipy.spatial.distance module, which offers various functions to compute distances between vectors, points, or matrices.

Method 1: Use scipy.spatial.distance.pdist

Python’s pdist function calculates pairwise distances between observations in a single dataset. It’s perfect when you need a distance matrix for points within the same set.

import numpy as np
from scipy.spatial.distance import pdist, squareform

# Create a sample dataset (3 points in 2D space)
X = np.array([[0, 0], [1, 0], [0, 1]])

# Calculate pairwise distances
distances = pdist(X)
print("Condensed distance matrix:")
print(distances)

# Convert to a square-form distance matrix
square_distances = squareform(distances)
print("\nSquare-form distance matrix:")
print(square_distances)

The output will be:

Condensed distance matrix:
[1.         1.         1.41421356]

Square-form distance matrix:
[[0.         1.         1.        ]
 [1.         0.         1.41421356]
 [1.         1.41421356 0.        ]]

I executed the above example code and added the screenshot below.

Note that pdist returns a condensed distance matrix – a 1D array that contains the upper triangular part of the distance matrix. You can convert it to a square matrix using squareform.

Method 2: Use scipy.spatial.distance.cdist

When you need to calculate distances between points in two different datasets, cdist is the Python function to use:

from scipy.spatial.distance import cdist

# Create two datasets
X = np.array([[0, 0], [1, 0], [0, 1]])
Y = np.array([[1, 1], [2, 2]])

# Calculate distances between points in X and Y
distances = cdist(X, Y)
print("Distances between X and Y:")
print(distances)

The output will be:

Distances between X and Y:
[[1.41421356 2.82842712]
 [1.         2.23606798]
 [1.         2.23606798]]

I executed the example code and added the screenshot below.

Each row represents a point in X, each column a point in Y, and each value is the distance between those points.

Read SciPy Find Peaks

Method 3: Use scipy.spatial.distance_matrix

For a simpler interface, specifically for Euclidean distance, you can use distance_matrix:

from scipy.spatial import distance_matrix

# Create a dataset
X = np.array([[0, 0], [1, 0], [0, 1]])

# Calculate the distance matrix
dm = distance_matrix(X, X)
print("Distance matrix:")
print(dm)

The output will be:

Distance matrix:
[[0.         1.         1.        ]
 [1.         0.         1.41421356]
 [1.         1.41421356 0.        ]]

I executed the above example code and added the screenshot below.

This produces a square distance matrix similar to squareform(pdist(X)).

Check out Python SciPy Chi-Square Test

Method 4: Custom Distance Functions

Sometimes you might need a custom distance metric. SciPy allows you to define your own:

from scipy.spatial.distance import pdist

# Custom weighted Euclidean distance
def weighted_euclidean(u, v, weights=[0.8, 0.2]):
    return np.sqrt(np.sum(weights * ((u - v) ** 2)))

# Create a dataset
X = np.array([[0, 0], [1, 0], [0, 1]])

# Calculate pairwise distances with custom metric
distances = pdist(X, metric=weighted_euclidean)
print("Custom weighted distances:")
print(distances)
print("Square form:")
print(squareform(distances))

This is particularly useful when certain dimensions or features should have more influence on the distance calculation than others.

Check out Python SciPy Exponential

Different Distance Metrics

Both pdist and cdist support various distance metrics. Here are some common ones:

# Euclidean distance (default)
euclidean_dist = cdist(X, Y, metric='euclidean')

# Manhattan distance
manhattan_dist = cdist(X, Y, metric='cityblock')

# Cosine distance
cosine_dist = cdist(X, Y, metric='cosine')

# Minkowski distance with p=3
minkowski_dist = cdist(X, Y, metric='minkowski', p=3)

print("Euclidean distances:")
print(euclidean_dist)
print("\nManhattan distances:")
print(manhattan_dist)
print("\nCosine distances:")
print(cosine_dist)
print("\nMinkowski distances (p=3):")
print(minkowski_dist)

SciPy supports over 20 different distance metrics, making it incredibly versatile for different applications.

Read Python SciPy Confidence Interval

Real-World Example: Customer Segmentation

Let’s look at a practical example. Say you’re analyzing customer data for a US retail chain and want to segment customers based on their purchasing behavior:

import numpy as np
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering

# Sample customer data: [annual_spend, visit_frequency]
customers = np.array([
    [120, 4],   # Customer 1: spends $120, visits 4 times monthly
    [600, 12],  # Customer 2: spends $600, visits 12 times monthly
    [350, 8],   # Customer 3: spends $350, visits 8 times monthly
    [100, 2],   # Customer 4: spends $100, visits 2 times monthly
    [700, 15],  # Customer 5: spends $700, visits 15 times monthly
    [550, 10]   # Customer 6: spends $550, visits 10 times monthly
])

# Normalize the data
customers_normalized = (customers - customers.mean(axis=0)) / customers.std(axis=0)

# Calculate distance matrix
dist_matrix = squareform(pdist(customers_normalized, metric='euclidean'))

# Perform hierarchical clustering
clustering = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
clusters = clustering.fit_predict(dist_matrix)

# Visualize the results
plt.figure(figsize=(10, 6))
colors = ['r', 'g', 'b']
for i, cluster in enumerate(np.unique(clusters)):
    plt.scatter(
        customers[clusters == cluster, 0],
        customers[clusters == cluster, 1],
        s=100, c=colors[i], label=f'Cluster {i+1}'
    )

plt.xlabel('Annual Spend ($)')
plt.ylabel('Monthly Visit Frequency')
plt.title('Customer Segmentation')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

This example showcases how pairwise distances can help segment customers into distinct groups based on their spending behavior and visit frequency.

Performance Considerations

When working with large datasets, performance becomes critical. Here are some tips:

Use pdist instead of nested loops, it’s optimized for speed
For very large datasets, consider using approximate nearest neighbor algorithms
When appropriate, use distance metrics that can be computed more efficiently (like Manhattan vs. Euclidean)
For extremely large datasets, you might need to use libraries like FAISS or Annoy

I hope you found this article helpful. Pairwise distance calculations are fundamental to many data science and machine learning tasks, and SciPy provides excellent tools to perform these calculations efficiently. Whether you’re clustering customer data, finding similar documents, or building recommendation systems, understanding these distance functions will serve you well.

Other Python articles you may also like:

Bijay Kumar

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.

enjoysharepoint.com/