Recently, I was working on a machine learning project where I needed to calculate distances between multiple data points efficiently. The challenge was computing distances between two collections of inputs without writing nested loops. That’s when I discovered SciPy’s spatial distance cdist function.
In this article, I’ll share how to use the cdist function to calculate pairwise distances between points in different datasets. This powerful tool has saved me countless hours of coding and significantly improved my data analysis workflow.
Let’s get in and explore this essential SciPy functionality together!
What is SciPy’s Spatial Distance Cdist?
Python’s cdist function is part of SciPy’s spatial distance module. It computes the distance between each pair of observations from two collections.
Unlike manual implementations, cdist is highly optimized and can handle various distance metrics efficiently.
Here’s the basic syntax:
from scipy.spatial.distance import cdist
result = cdist(XA, XB, metric='euclidean')Where:
XA: First collection of input vectorsXB: Second collection of input vectorsmetric: The distance metric to use
The output is a distance matrix where each element represents the distance between points from the first and second collections.
Set Up Your Environment
Before we begin using the cdist function, let’s make sure we have the necessary packages installed:
# Install required packages
# pip install numpy scipy matplotlib
# Import necessary libraries
import numpy as np
from scipy.spatial.distance import cdist
import matplotlib.pyplot as pltUse Cdist with Euclidean Distance (Default Metric)
The Euclidean distance is the most common metric and is the default for the cdist function.
Let’s see a practical example using the coordinates of popular tourist attractions in New York City:
# Coordinates of NYC attractions (latitude, longitude)
locations_a = np.array([
[40.7128, -74.0060], # NYC Downtown
[40.7484, -73.9857], # Empire State Building
[40.7580, -73.9855] # Times Square
])
locations_b = np.array([
[40.7061, -73.9969], # Brooklyn Bridge
[40.7527, -73.9772], # Grand Central
[40.7516, -73.9776], # New York Public Library
[40.7794, -73.9632] # Metropolitan Museum of Art
])
# Calculate Euclidean distances
distances = cdist(locations_a, locations_b)
print("Euclidean distances between NYC locations:")
print(distances)Output:
Euclidean distances between NYC locations:
[[0.01130044 0.04920823 0.04808326 0.07916691]
[0.04375763 0.00952575 0.00870919 0.0383047 ]
[0.05313728 0.00984784 0.0101671 0.03090712]]You can refer to the screenshot below to see the output:

The resulting matrix shows the Euclidean distance between each pair of locations, which could be useful for planning a tourist route.
Cdist with Different Distance Metrics
One of the most useful features of cdist is its support for multiple distance metrics. Here are some common ones:
Check out Python SciPy Eigenvalues
1. Manhattan Distance (Cityblock)
Perfect for grid-like urban environments like Manhattan, where movement is restricted to perpendicular directions:
manhattan_distances = cdist(locations_a, locations_b, metric='cityblock')
print("Manhattan distances between NYC locations:")
print(manhattan_distances)Output:
Manhattan distances between NYC locations:
[[0.0158 0.0687 0.0672 0.1094]
[0.0535 0.0128 0.0113 0.0535]
[0.0633 0.0136 0.0143 0.0437]]You can refer to the screenshot below to see the output:

2. Cosine Distance
Ideal for comparing document vectors in natural language processing:
# Example document vectors (word frequencies)
document_a = np.array([
[5, 0, 3, 1], # Document 1
[2, 4, 0, 2] # Document 2
])
document_b = np.array([
[1, 1, 2, 0], # Document 3
[3, 2, 0, 1] # Document 4
])
cosine_distances = cdist(document_a, document_b, metric='cosine')
print("Cosine distances between documents:")
print(cosine_distances)3. Jaccard Distance
Great for comparing sets or binary vectors:
# Binary vectors representing features
features_a = np.array([
[1, 0, 1, 1, 0], # Product A features
[0, 1, 1, 0, 1] # Product B features
])
features_b = np.array([
[1, 1, 1, 0, 0], # Product C features
[0, 0, 1, 1, 1] # Product D features
])
jaccard_distances = cdist(features_a, features_b, metric='jaccard')
print("Jaccard distances between product features:")
print(jaccard_distances)Output:
Jaccard distances between product features:
[[0.5 0.5]
[0.5 0.5]]You can refer to the screenshot below to see the output:

Read Python SciPy Kdtree
Custom Distance Metrics with Cdist
Sometimes, predefined metrics aren’t sufficient. Fortunately, cdist allows us to define custom distance functions:
def weighted_euclidean(u, v, weights=np.array([2.0, 1.0])):
"""Compute weighted Euclidean distance, giving more importance to first dimension."""
return np.sqrt(np.sum(weights * ((u - v) ** 2)))
# Calculate distances with custom metric
custom_distances = cdist(locations_a, locations_b, metric=weighted_euclidean)
print("Custom weighted distances:")
print(custom_distances)In this example, we give more weight to the latitude than the longitude in our distance calculation.
Practical Application: Cluster Similar Data Points
Let’s see how cdist can be used in a real-world clustering scenario. I’ll use it to group similar customer profiles based on spending habits:
# Customer profiles (age, annual_income, spending_score)
customer_group_a = np.array([
[25, 72000, 85],
[35, 81000, 75],
[45, 95000, 60]
])
customer_group_b = np.array([
[28, 70000, 80],
[50, 120000, 40],
[32, 78000, 70],
[60, 130000, 30]
])
# Calculate distances
customer_distances = cdist(customer_group_a, customer_group_b)
# Find the most similar customers (minimum distance)
for i, customer_a in enumerate(customer_group_a):
most_similar_idx = np.argmin(customer_distances[i])
print(f"Customer A{i+1} is most similar to Customer B{most_similar_idx+1}")
print(f"Profile A{i+1}: {customer_a}")
print(f"Profile B{most_similar_idx+1}: {customer_group_b[most_similar_idx]}")
print(f"Distance: {customer_distances[i][most_similar_idx]}\n")Visualize Distance Matrices
Visualization can help understand distance relationships better:
def plot_distance_matrix(distances, title):
plt.figure(figsize=(10, 8))
plt.imshow(distances, cmap='viridis')
plt.colorbar(label='Distance')
plt.title(title)
plt.xlabel('Location B Index')
plt.ylabel('Location A Index')
# Add distance values as text
for i in range(distances.shape[0]):
for j in range(distances.shape[1]):
plt.text(j, i, f'{distances[i, j]:.2f}',
ha='center', va='center',
color='white' if distances[i, j] > np.mean(distances) else 'black')
plt.tight_layout()
plt.show()
# Visualize Euclidean distances
plot_distance_matrix(distances, 'Euclidean Distances Between NYC Locations')Performance Considerations
When working with large datasets, performance becomes crucial. Here’s a comparison between cdist and a manual implementation:
import time
# Generate larger random datasets
large_a = np.random.rand(1000, 2)
large_b = np.random.rand(1000, 2)
# Time cdist
start_time = time.time()
cdist_result = cdist(large_a, large_b)
cdist_time = time.time() - start_time
print(f"Cdist computation time: {cdist_time:.4f} seconds")
# Time manual implementation
start_time = time.time()
manual_result = np.zeros((large_a.shape[0], large_b.shape[0]))
for i in range(large_a.shape[0]):
for j in range(large_b.shape[0]):
manual_result[i, j] = np.sqrt(np.sum((large_a[i] - large_b[j])**2))
manual_time = time.time() - start_time
print(f"Manual computation time: {manual_time:.4f} seconds")
print(f"Speedup factor: {manual_time/cdist_time:.2f}x")The results typically show that cdist is significantly faster than manual implementations, especially for large datasets.
Practical Tips for Using Cdist Effectively
- Memory considerations: Distance matrices grow quadratically with input size. For very large datasets, consider computing distances in batches.
- Preprocessing: Normalize your data before calculating distances if features have different scales.
- Choosing the right metric: Select a distance metric that matches your data characteristics and application needs.
- Sparse data: For sparse data, use specialized metrics like cosine or Jaccard.
I’ve found that cdist has become an essential tool in my data science toolkit. Whether I’m working on recommendation systems, clustering algorithms, or similarity searches, this function consistently delivers accurate results with minimal coding effort.
If you’re working with distance calculations in Python, I highly recommend integrating cdist into your workflow. It’s well-documented, highly optimized, and supports a wide range of distance metrics for various applications.
Other Python tutorials you may also like:

I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.