Python Scipy Fcluster

The tutorial teaches us about “Python Scipy Fcluster” to cluster similar observations into one or more than one cluster, and also we will learn the steps behind clustering the data points with the following topics.

  • What is clustering?
  • How to create a cluster in Python Scipy
  • Python Scipy Cluster T
  • How to get the required cluster using the Maxclust
  • Python Scipy Cluster Inconsistent
  • Python Scipy Fcluster Data

What is clustering?

Unsupervised machine learning tasks include clustering. Because of how this process operates, We could also hear it called cluster analysis.

When using a clustering method, we will be providing the algorithm with a large amount of unlabeled input data and allow it to identify whatever groups or collections of data it can.

These collections are known as clusters. A cluster is a collection of data points that are related to one another based on how they relate to other data points in the area. Pattern discovery and feature engineering are two applications of clustering.

The fundamental idea behind clustering is the division of a set of observations into subgroups or clusters so that observations belonging to the same cluster have some characteristics.

Read: Python Scipy Interpolate

Python Scipy Fcluster

There is a method fcluster() of Python Scipy in a module scipy.cluster.hierarchy creates flat clusters from the hierarchical clustering that the provided linkage matrix has defined.

The syntax is given below.

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)

Where parameters are:

  • Z(ndarray): The linking function’s return matrix is encoded with hierarchical clustering.
  • t(scalar): For criteria ‘inconsistent’, ‘distance’ or ‘monocrit’: applying this threshold will result in flat clusters. For ‘maxclust’ or ‘maxclust_monocrit’ criteria: the maximum number of clusters requested would be this.
  • criterion(string): The criterion to be applied while creating flat clusters. Any of the following values may be used as this : inconsistent, distance, maxclust, monocrit and maxclust_monocrit
  • depth(int): The maximum depth at which the inconsistency calculation can be made. Regarding the other criteria, it means nothing. 2 is the default.
  • R(ndarray): The matrix of inconsistencies to be applied to the “inconsistent” criterion. If not given, this matrix is computed.
  • monocrit(ndarray): A set of n-1 elements. The statistics used to threshold non-singleton I are called monocrit[i]. The monocrit vector must be monotonic, meaning that given a node c with index I monocrit[i] >= monocrit[j] for all node indices j corresponding to nodes below c.

The method fcluster() returns fclusters (T[i] is the flat cluster number to which the original observation i belongs. It is an array of length n).

Let’s take an example by following the below steps:

Import the required libraries or methods using the below python code.

from scipy.cluster import hierarchy
from scipy.spatial import distance

The output of any cluster linkage method, such as scipy.cluster.hierarchy.ward, is a linkage matrix Z. Create an x array of data which is the start and end points of the USA cities using the below code.

X_ = [[0, 0], [0, 2], [2, 0],
     [0, 5], [0, 4], [2, 5],
     [5, 0], [4, 0], [5, 2],
     [5, 5], [4, 5], [5, 4]]

Condense the input data X_ into a matrix using the method pdist() and pass this data to the clustering method ward() using the below code.

Z_ = hierarchy.ward(distance.pdist(X_))
Z_
Python Scipy Fcluster
Python Scipy Fcluster

The first and second elements of the above matrix, which represents a dendrogram, are the two clusters that were combined at each step.

READ:  Django Upload Image File

The third element in the matrix is the distance between the two clusters, and the fourth element is the size of the new cluster or the number of original data points that were included.

Now pass the above matrix to method fcluster using the below code.

hierarchy.fcluster(Z_, t=0.8, criterion='distance')
Python Scipy Fcluster Example
Python Scipy Fcluster Example

Twelve separate clusters are returned because the threshold t is too tiny to allow any two samples in the data to create a cluster. We can adjust the threshold (t) to form a cluster that we will learn in the next subsection.

Read Python Scipy Pairwise Distance

Python Scipy Fcluster T

The dendrogram can be flattened using scipy.cluster.hierarchy.fcluster which assigns the original data points to single clusters. This assignment is largely determined by a distance threshold (t), which is the maximum inter-cluster distance permitted.

Through this section, we are continuing the same example that we have used in the above subsection “Python Scipy Fcluster”.

Run the below code after the above subsection codes to know how the threshold (t) works.

hierarchy.fcluster(Z_, t=0.6, criterion='distance')

Run the same code with t=1.0 using the below code.

hierarchy.fcluster(Z_, t=1.0, criterion='distance')

Then t=3.1

hierarchy.fcluster(Z_, t=3.1, criterion='distance')

At last t=10

hierarchy.fcluster(Z_, t=10, criterion='distance')
Python Scipy Fcluster T
Python Scipy Fcluster T
  • 12 separate clusters are returned in the first scenario because the threshold t is too low to allow any two samples in the data to create a cluster.
  • In the second scenario, the threshold is high enough to permit the fusion of the points with those that are closest to them. Thus, only 9 clusters are returned in this case.
  • Up to 8 data points may be connected in the third scenario, which has a significantly higher threshold; as a result, 4 clusters are returned in this situation.
  • Finally, the fourth case’s threshold is high enough to permit the fusion of all data points, resulting in the return of a single cluster.

This is how to use the threshold (t) to form the cluster.

Read Working with Python Scipy Linalg Svd

Python Scipy Cluster Maxclust

The method fcluster() accepts a parameter criterion that is applied while creating flat clusters. It can be any of the following values.

  • inconsistent: Any leaf descendants of a cluster node that have an inconsistent value less than or equal to t are considered to be members of the same flat cluster. Every node is given its own cluster if no non-singleton cluster fulfils this requirement.
  • distance: Creates flat clusters with a maximum cophenetic distance of t between the initial observations in each flat cluster.
  • maxclust: Finds a minimum threshold r below which no more than t flat clusters can form and the cophenetic distance between any two original observations in a single flat cluster cannot exceed r.
  • monocrit: Where monocrit[j] = t, creates a flat cluster from a cluster node c with index i.
  • maxclust_monocrit: When monocrit[i] = r for all cluster indices I below and including c, forms a flat cluster from a non-singleton cluster node c. R is reduced so that t or fewer flat clusters can form. There must be monotony in monocrit.
READ:  How to Start With Python Tkinter [With Examples]

Remember from the second subsection of this tutorial that the parameter t for ‘maxclust’ or ‘maxclust_monocrit’ criteria would be a maximum number of clusters requested.

Here we will directly use the same code that we have used in the above subsection “Python Scipy Fcluster”.

Suppose we need to form 5 clusters then the value of t will be 5 and criterion equal to maxclust as shown in the below code.

hierarchy.fcluster(Z_, t=5, criterion='maxclust')
Python Scipy Cluster Maxclust
Python Scipy Cluster Maxclust

From the above output, we got the five clusters such as first_cluster = [2, 2], second_cluster = [3], third_cluster = [5, 5, 5], fourth_cluster = [1, 1, 1] and the fifth_cluster = [4, 4, 4].

This is how to use the value maxclust for the criterion with a parameter t to get the number of required clusters.

Read Python Scipy Smoothing

Python Scipy Cluster Inconsistent

We already know from the above subsection that the method fcluster() accepts a parameter criterion that is applied while creating flat clusters. This criterion accepts a value inconsistent.

Inconsistent means If a cluster node’s inconsistent value is less than or equal to t, then all of the node’s leaf descendants are members of the same flat cluster. When no non-singleton cluster satisfies this requirement, each node is given its own cluster.

Let’s see an example by following the below steps.

Import the required libraries or methods using the below python code.

from scipy.cluster import hierarchy
from scipy.spatial import distance

Create an x array of data which is the start and end distance points of the USA States such as Alabama (0,0 to 0,2), California (0,2 to 2,0), Florida (2,0 to 0,3), Georgia (0,3 to 0,2), Hawaii (0,2 to 2, 5) and so on for Indiana, Kentucky, Montana, Nevada, New Jersy and New York using the below code.

X_ = [[0, 0], [0, 2], [2, 0],
     [0, 3], [0, 2], [2, 5],
     [3, 0], [4, 0], [5, 2],
     [5, 5], [4, 5], [5, 4]]
Z_ = hierarchy.ward(distance.pdist(X_))

Now pass the above data to method fcluster() with criterion equla to inconsistent using the below code.

hierarchy.fcluster(Z_, t= 0.9, criterion='inconsistent')
Python Scipy Cluster Inconsistent
Python Scipy Cluster Inconsistent

Read Python Scipy Ndimage Imread Tutorial

Python Scipy Fcluster Data

The method fclusterdata() in a module scipy.cluster.hierarchy of Python Scipy used a certain metric, group observational data.

X, which contains n observations in m dimensions, performs hierarchical clustering using the single linkage algorithm, flat clustering using the inconsistency method with t as the cut-off threshold, and clustering of the original observations using the single linkage algorithm.

The syntax is given below.

scipy.cluster.hierarchy.fclusterdata(X, t, criterion='inconsistent', metric='euclidean', depth=2, method='single', R=None)

Where parameters are:

  • X(ndarray (N, M): With N observations in M dimensions, the data matrix is N by M.
  • t(scalar): For criteria ‘inconsistent’, ‘distance’ or ‘monocrit’: applying this threshold will result in flat clusters. For ‘maxclust’ or ‘maxclust_monocrit’ criteria: the maximum number of clusters requested would be this.
  • criterion(string): The criterion to be applied while creating flat clusters. Any of the following values may be used as this : inconsistent, distance, maxclust, monocrit and maxclust_monocrit.
  • metric(string): The metric of distance used to compute pairwise distances.
  • depth(int): The maximum depth at which the inconsistency calculation can be made. Regarding the other criteria, it means nothing. 2 is the default.
  • method(string): The recommended linkage method (complete, single, average, weighted, ward, median centroid).
  • R(ndarray): The matrix of inconsistencies to be applied to the “inconsistent” criterion. If not given, this matrix is computed.
READ:  Matplotlib save as png

The method fclusterdata() returns fclusterdata (T[i] is the flat cluster number to which the original observation i belongs. It is a vector of length n).

Let’s see an example with the same data that we have created in the above subsection “Python Scipy Cluster Inconsistent” by following the below steps.

Import the required libraries or methods using the below python code.

from scipy.cluster import hierarchy

Create an x array of data which is the start and end distance points of the USA States such as Alabama (0,0 to 0,2), California (0,2 to 2,0), Florida (2,0 to 0,3), Georgia (0,3 to 0,2), Hawaii (0,2 to 2, 5) and so on for Indiana, Kentucky, Montana, Nevada, New Jersy and New York using the below code.

X_ = [[0, 0], [0, 2], [2, 0],
     [0, 3], [0, 2], [2, 5],
     [3, 0], [4, 0], [5, 2],
     [5, 5], [4, 5], [5, 4]]

Use “scipy.cluster.hierarchy.fcluster,” to find flat clusters with a user-specified distance threshold t = 1.0.

hierarchy.fclusterdata(X_, t=1.0)
Python Scipy Fcluster Data
Python Scipy Fcluster Data

In the above output, four clusters are the result for dataset X_, distance threshold t = 1.0.

All the steps in a typical SciPy hierarchical clustering workflow are abstracted by the convenience method “fclusterdata()” that we have performed in the subsection “Python Scipy Fcluster” such as the following steps:

  • Using scipy.spatial.distance.pdist, create a condensed matrix from the provided data.
  • Use a clustering approach like ward().
  • Using scipy.cluster.hierarchy.fcluster, find flat clusters with a user-defined distance threshold t.

All the above three steps can be done using the method fclusterdata().

We have learned about how to cluster similar data points using “Python Scipy Fcluster”, and get the required number of clusters using the criterion value maxclust. Also, we have covered the following topics.

  • What is clustering?
  • How to create the cluster in Python Scipy
  • Python Scipy Cluster T
  • How to get the required cluster using the Maxclust
  • Python Scipy Cluster Inconsistent
  • Python Scipy Fcluster Data

You may like the following Python Scipy tutorials: