Python SciPy Fcluster: Hierarchical Cluster

Recently, I was working on a data science project where I needed to group similar data points. The challenge was finding an efficient way to perform hierarchical clustering and extract meaningful clusters from my dataset. That’s when I discovered SciPy’s fcluster function, a powerful tool that made this complex task surprisingly easy.

In this article, I’ll share how to use SciPy’s fcluster function to create and analyze hierarchical clusters in Python. I’ll cover everything from basic implementation to practical examples that you can apply to your projects.

So let’s get started..!

What is SciPy’s fcluster Function?

The fcluster function is part of SciPy’s cluster module and works hand-in-hand with hierarchical clustering. It helps convert the hierarchical representation (dendrogram) into flat clusters that are easier to work with and interpret.

Think of fcluster as the tool that cuts the dendrogram at a specific height to give you distinct groups of data points.

Read Python SciPy Gamma

Set Up Your Environment

Before we start clustering, let’s make sure you have everything installed:

# Import necessary libraries
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt

Method 1: Use fcluster with Distance Threshold

The simple way to use fcluster is by specifying a distance threshold. This approach cuts the dendrogram at a specific height.

Here’s a simple example using customer purchase data:

# Sample data: Customer purchase amounts ($) and frequency (times per month)
data = np.array([
    [120, 5],  # Customer 1: Spends $120, shops 5 times monthly
    [105, 6],  # Customer 2
    [500, 2],  # Customer 3
    [45, 8],   # Customer 4
    [510, 1],  # Customer 5
    [55, 7],   # Customer 6
    [95, 6],   # Customer 7
    [540, 3]   # Customer 8
])

# Compute the linkage matrix
Z = linkage(data, method='ward')

# Form flat clusters using a distance threshold of 150
clusters = fcluster(Z, 150, criterion='distance')

# Print the cluster assignments
print("Customer Cluster Assignments:", clusters)

# Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', s=100)
plt.xlabel('Purchase Amount ($)')
plt.ylabel('Purchase Frequency (monthly)')
plt.title('Customer Segmentation using fcluster')
plt.colorbar(label='Cluster')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

You can see the output in the screenshot below:

In this example, we’re segmenting customers based on their purchasing behavior. The fcluster function groups them by cutting the dendrogram at a distance of 150, which in this case gives us a clear separation between high-spenders who shop less frequently and lower-spenders who shop more often.

Check out Python SciPy ttest_ind

Method 2: Use fcluster with Maxclust Criterion

Sometimes, you know exactly how many clusters you want rather than specifying a distance threshold. In such cases, the ‘maxclust’ criterion is perfect:

# Specify that we want exactly 3 clusters
max_d = 3
clusters = fcluster(Z, max_d, criterion='maxclust')

print("Customer Clusters with maxclust:", clusters)

# Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', s=100)
plt.xlabel('Purchase Amount ($)')
plt.ylabel('Purchase Frequency (monthly)')
plt.title('Customer Segmentation with 3 Clusters')
plt.colorbar(label='Cluster')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

You can see the output in the screenshot below.

This approach is particularly useful when you’re working on marketing segmentation problems where you need a specific number of customer segments for targeted campaigns.

Read Python SciPy Derivative of Array

Method 3: Visualize Clusters with Dendrograms

To better understand how fcluster works, it’s helpful to visualize the hierarchical clustering process using dendrograms:

plt.figure(figsize=(12, 6))
dendrogram(Z)
plt.axhline(y=150, color='r', linestyle='--', label='Distance Threshold')
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.legend()
plt.show()

# Form clusters using the same threshold as our horizontal line
clusters = fcluster(Z, 150, criterion='distance')
print("Clusters from dendrogram cut:", clusters)

You can see the output in the screenshot below.

The dendrogram visually represents how our data points are merged into clusters at different distance levels. The horizontal red line shows where fcluster cuts the tree to form our flat clusters.

Check out Python SciPy Load Mat File

Method 4: Use fcluster with Different Linkage Methods

The linkage method you choose can significantly impact your clustering results. Let’s compare a few:

# Different linkage methods
methods = ['single', 'complete', 'average', 'ward']
plt.figure(figsize=(15, 10))

for i, method in enumerate(methods):
    # Compute linkage
    Z = linkage(data, method=method)

    # Form clusters
    clusters = fcluster(Z, 3, criterion='maxclust')

    # Plot
    plt.subplot(2, 2, i+1)
    plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', s=100)
    plt.title(f'Clusters using {method} linkage')
    plt.xlabel('Purchase Amount ($)')
    plt.ylabel('Purchase Frequency (monthly)')
    plt.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

In my experience, ‘ward’ linkage often produces the most intuitive clusters for business data, while ‘single’ linkage can be useful for detecting outliers in your customer dataset.

Read Python SciPy Curve Fit

Method 5: Evaluate Cluster Quality

After creating clusters, you’ll want to evaluate how well they represent your data:

from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

# Compute the cophenetic correlation coefficient
c, coph_dists = cophenet(Z, pdist(data))
print(f"Cophenetic Correlation Coefficient: {c}")

# Calculate within-cluster sum of squares for each cluster
from sklearn.metrics import silhouette_score

# Need at least 2 clusters for silhouette score
if len(np.unique(clusters)) >= 2:
    silhouette_avg = silhouette_score(data, clusters)
    print(f"Silhouette Score: {silhouette_avg}")

A higher cophenetic correlation coefficient (closer to 1) indicates that the clustering preserves the original distances well. Similarly, a higher silhouette score suggests better-defined clusters.

Check out Python SciPy Stats Fit

Real-World Application: Market Segmentation

Let’s apply fcluster to a more realistic scenario, segmenting the U.S. smartphone market based on price sensitivity and feature preferences:

# Synthetic data representing U.S. smartphone consumers
# [Price sensitivity (0-10), Feature importance (0-10)]
np.random.seed(42)
n_samples = 200
smartphone_consumers = np.zeros((n_samples, 2))

# Create four natural clusters
# Budget-conscious, feature-indifferent
smartphone_consumers[:50, 0] = np.random.normal(8, 1, 50)  # High price sensitivity
smartphone_consumers[:50, 1] = np.random.normal(3, 1, 50)  # Low feature importance

# Budget-conscious, tech enthusiasts
smartphone_consumers[50:100, 0] = np.random.normal(7, 1, 50)  # High price sensitivity
smartphone_consumers[50:100, 1] = np.random.normal(8, 1, 50)  # High feature importance

# Premium buyers, feature-focused
smartphone_consumers[100:150, 0] = np.random.normal(3, 1, 50)  # Low price sensitivity
smartphone_consumers[100:150, 1] = np.random.normal(9, 1, 50)  # High feature importance

# Premium buyers, brand-focused
smartphone_consumers[150:, 0] = np.random.normal(2, 1, 50)  # Low price sensitivity
smartphone_consumers[150:, 1] = np.random.normal(5, 1, 50)  # Medium feature importance

# Clip values to stay within 0-10 range
smartphone_consumers = np.clip(smartphone_consumers, 0, 10)

# Perform hierarchical clustering
Z = linkage(smartphone_consumers, method='ward')

# Get clusters (we expect 4 natural segments)
clusters = fcluster(Z, 4, criterion='maxclust')

# Visualize the market segments
plt.figure(figsize=(10, 8))
plt.scatter(smartphone_consumers[:, 0], smartphone_consumers[:, 1], 
            c=clusters, cmap='viridis', s=50, alpha=0.8)
plt.xlabel('Price Sensitivity (higher = more sensitive)')
plt.ylabel('Feature Importance (higher = more important)')
plt.title('U.S. Smartphone Market Segmentation')
plt.colorbar(label='Market Segment')
plt.grid(True, linestyle='--', alpha=0.7)

# Add segment labels
segment_centers = []
for i in range(1, max(clusters)+1):
    mask = clusters == i
    center_x = np.mean(smartphone_consumers[mask, 0])
    center_y = np.mean(smartphone_consumers[mask, 1])
    segment_centers.append((center_x, center_y))
    plt.annotate(f'Segment {i}', (center_x, center_y), 
                 fontsize=12, fontweight='bold')

plt.show()

# Print segment characteristics
for i in range(1, max(clusters)+1):
    mask = clusters == i
    price_sens = np.mean(smartphone_consumers[mask, 0])
    feature_imp = np.mean(smartphone_consumers[mask, 1])
    size = np.sum(mask)
    print(f"Segment {i}: {size} consumers, Avg price sensitivity: {price_sens:.2f}, "
          f"Avg feature importance: {feature_imp:.2f}")

This example shows how you can use fcluster to identify distinct customer segments in the U.S. smartphone market, which could help marketers develop targeted campaigns for each group.

Tips for Using fcluster Effectively

Based on my experience, here are some practical tips for getting the most out of fcluster:

Standardize your data: If your features have different scales, standardize them before clustering.
Try multiple linkage methods: Different methods can reveal different patterns in your data.
Visualize before deciding: Always look at the dendrogram to get a sense of where to cut for meaningful clusters.
Validate with domain knowledge: Make sure the resulting clusters make sense in your business context.
Consider other clustering methods: Sometimes, k-means or DBSCAN might be more appropriate for your specific problem.

Python’s SciPy library makes hierarchical clustering accessible and powerful through the fcluster function. Whether you’re segmenting customers, grouping similar products, or analyzing any dataset with natural groupings, fcluster provides a flexible way to extract meaningful insights.

Other Python articles you may also like:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/