Understand SciPy Spatial Distance Cdist

Recently, I was working on a machine learning project where I needed to calculate distances between multiple data points efficiently. The challenge was computing distances between two collections of inputs without writing nested loops. That’s when I discovered SciPy’s spatial distance cdist function.

In this article, I’ll share how to use the cdist function to calculate pairwise distances between points in different datasets. This powerful tool has saved me countless hours of coding and significantly improved my data analysis workflow.

Let’s get in and explore this essential SciPy functionality together!

What is SciPy’s Spatial Distance Cdist?

Python’s cdist function is part of SciPy’s spatial distance module. It computes the distance between each pair of observations from two collections.

Unlike manual implementations, cdist is highly optimized and can handle various distance metrics efficiently.

Here’s the basic syntax:

from scipy.spatial.distance import cdist
result = cdist(XA, XB, metric='euclidean')

Where:

  • XA: First collection of input vectors
  • XB: Second collection of input vectors
  • metric: The distance metric to use

The output is a distance matrix where each element represents the distance between points from the first and second collections.

Read Python SciPy Stats Mode

Set Up Your Environment

Before we begin using the cdist function, let’s make sure we have the necessary packages installed:

# Install required packages
# pip install numpy scipy matplotlib

# Import necessary libraries
import numpy as np
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

Use Cdist with Euclidean Distance (Default Metric)

The Euclidean distance is the most common metric and is the default for the cdist function.

Let’s see a practical example using the coordinates of popular tourist attractions in New York City:

# Coordinates of NYC attractions (latitude, longitude)
locations_a = np.array([
    [40.7128, -74.0060],  # NYC Downtown
    [40.7484, -73.9857],  # Empire State Building
    [40.7580, -73.9855]   # Times Square
])

locations_b = np.array([
    [40.7061, -73.9969],  # Brooklyn Bridge
    [40.7527, -73.9772],  # Grand Central
    [40.7516, -73.9776],  # New York Public Library
    [40.7794, -73.9632]   # Metropolitan Museum of Art
])

# Calculate Euclidean distances
distances = cdist(locations_a, locations_b)
print("Euclidean distances between NYC locations:")
print(distances)

Output:

Euclidean distances between NYC locations:
[[0.01130044 0.04920823 0.04808326 0.07916691]
 [0.04375763 0.00952575 0.00870919 0.0383047 ]
 [0.05313728 0.00984784 0.0101671  0.03090712]]

You can refer to the screenshot below to see the output:

cdist python

The resulting matrix shows the Euclidean distance between each pair of locations, which could be useful for planning a tourist route.

Cdist with Different Distance Metrics

One of the most useful features of cdist is its support for multiple distance metrics. Here are some common ones:

Check out Python SciPy Eigenvalues

1. Manhattan Distance (Cityblock)

Perfect for grid-like urban environments like Manhattan, where movement is restricted to perpendicular directions:

manhattan_distances = cdist(locations_a, locations_b, metric='cityblock')
print("Manhattan distances between NYC locations:")
print(manhattan_distances)

Output:

Manhattan distances between NYC locations:
[[0.0158 0.0687 0.0672 0.1094]
 [0.0535 0.0128 0.0113 0.0535]
 [0.0633 0.0136 0.0143 0.0437]]

You can refer to the screenshot below to see the output:

cdist

2. Cosine Distance

Ideal for comparing document vectors in natural language processing:

# Example document vectors (word frequencies)
document_a = np.array([
    [5, 0, 3, 1],  # Document 1
    [2, 4, 0, 2]   # Document 2
])

document_b = np.array([
    [1, 1, 2, 0],  # Document 3
    [3, 2, 0, 1]   # Document 4
])

cosine_distances = cdist(document_a, document_b, metric='cosine')
print("Cosine distances between documents:")
print(cosine_distances)

3. Jaccard Distance

Great for comparing sets or binary vectors:

# Binary vectors representing features
features_a = np.array([
    [1, 0, 1, 1, 0],  # Product A features
    [0, 1, 1, 0, 1]   # Product B features
])

features_b = np.array([
    [1, 1, 1, 0, 0],  # Product C features
    [0, 0, 1, 1, 1]   # Product D features
])

jaccard_distances = cdist(features_a, features_b, metric='jaccard')
print("Jaccard distances between product features:")
print(jaccard_distances)

Output:

Jaccard distances between product features:
[[0.5 0.5]
 [0.5 0.5]]

You can refer to the screenshot below to see the output:

scipy cdist

Read Python SciPy Kdtree

Custom Distance Metrics with Cdist

Sometimes, predefined metrics aren’t sufficient. Fortunately, cdist allows us to define custom distance functions:

def weighted_euclidean(u, v, weights=np.array([2.0, 1.0])):
    """Compute weighted Euclidean distance, giving more importance to first dimension."""
    return np.sqrt(np.sum(weights * ((u - v) ** 2)))

# Calculate distances with custom metric
custom_distances = cdist(locations_a, locations_b, metric=weighted_euclidean)
print("Custom weighted distances:")
print(custom_distances)

In this example, we give more weight to the latitude than the longitude in our distance calculation.

Practical Application: Cluster Similar Data Points

Let’s see how cdist can be used in a real-world clustering scenario. I’ll use it to group similar customer profiles based on spending habits:

# Customer profiles (age, annual_income, spending_score)
customer_group_a = np.array([
    [25, 72000, 85],
    [35, 81000, 75],
    [45, 95000, 60]
])

customer_group_b = np.array([
    [28, 70000, 80],
    [50, 120000, 40],
    [32, 78000, 70],
    [60, 130000, 30]
])

# Calculate distances
customer_distances = cdist(customer_group_a, customer_group_b)

# Find the most similar customers (minimum distance)
for i, customer_a in enumerate(customer_group_a):
    most_similar_idx = np.argmin(customer_distances[i])
    print(f"Customer A{i+1} is most similar to Customer B{most_similar_idx+1}")
    print(f"Profile A{i+1}: {customer_a}")
    print(f"Profile B{most_similar_idx+1}: {customer_group_b[most_similar_idx]}")
    print(f"Distance: {customer_distances[i][most_similar_idx]}\n")

Visualize Distance Matrices

Visualization can help understand distance relationships better:

def plot_distance_matrix(distances, title):
    plt.figure(figsize=(10, 8))
    plt.imshow(distances, cmap='viridis')
    plt.colorbar(label='Distance')
    plt.title(title)
    plt.xlabel('Location B Index')
    plt.ylabel('Location A Index')

    # Add distance values as text
    for i in range(distances.shape[0]):
        for j in range(distances.shape[1]):
            plt.text(j, i, f'{distances[i, j]:.2f}', 
                     ha='center', va='center', 
                     color='white' if distances[i, j] > np.mean(distances) else 'black')

    plt.tight_layout()
    plt.show()

# Visualize Euclidean distances
plot_distance_matrix(distances, 'Euclidean Distances Between NYC Locations')

Performance Considerations

When working with large datasets, performance becomes crucial. Here’s a comparison between cdist and a manual implementation:

import time

# Generate larger random datasets
large_a = np.random.rand(1000, 2)
large_b = np.random.rand(1000, 2)

# Time cdist
start_time = time.time()
cdist_result = cdist(large_a, large_b)
cdist_time = time.time() - start_time
print(f"Cdist computation time: {cdist_time:.4f} seconds")

# Time manual implementation
start_time = time.time()
manual_result = np.zeros((large_a.shape[0], large_b.shape[0]))
for i in range(large_a.shape[0]):
    for j in range(large_b.shape[0]):
        manual_result[i, j] = np.sqrt(np.sum((large_a[i] - large_b[j])**2))
manual_time = time.time() - start_time
print(f"Manual computation time: {manual_time:.4f} seconds")
print(f"Speedup factor: {manual_time/cdist_time:.2f}x")

The results typically show that cdist is significantly faster than manual implementations, especially for large datasets.

Practical Tips for Using Cdist Effectively

  1. Memory considerations: Distance matrices grow quadratically with input size. For very large datasets, consider computing distances in batches.
  2. Preprocessing: Normalize your data before calculating distances if features have different scales.
  3. Choosing the right metric: Select a distance metric that matches your data characteristics and application needs.
  4. Sparse data: For sparse data, use specialized metrics like cosine or Jaccard.

I’ve found that cdist has become an essential tool in my data science toolkit. Whether I’m working on recommendation systems, clustering algorithms, or similarity searches, this function consistently delivers accurate results with minimal coding effort.

If you’re working with distance calculations in Python, I highly recommend integrating cdist into your workflow. It’s well-documented, highly optimized, and supports a wide range of distance metrics for various applications.

Other Python tutorials you may also like:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.