# Python Scipy Cluster Vq

In the Python Scipy tutorial, “Python Scipy Cluster Vq” where we will cluster the given data according to categories or group the sample points by covering the following topics.

• What is cluster Vq in Scipy?
• How to assign code to observations from the codebook using the method vq()
• How to cluster the given data using the Kmeans
• Python Scipy Cluster Vq Whiten

## What is cluster Vq in Scipy?

The package `scipy.cluster` contains instructions for doing k-means clustering, creating code books from k-means models, and quantizing vectors by comparing them to centroids in a code book.

The k-means method requires two inputs: a set of observation vectors to cluster and the desired number of clusters, k. One centrifugal force is returned for each of the k clusters. The cluster number or centroid index of the nearest centroid is used to categorize an observation vector.

If a vector v is closer to centroid I than any other centroid, then cluster I will contain that vector. If v is a member of I then I is referred to as v’s dominant centroid. By summing the squared distances between each observation vector and its dominant centroid, distortionâ€”which is defined as distortionâ€”is what the k-means algorithm seeks to minimize.

By repeatedly rearranging the observations into clusters and recalculating the centroids until a configuration is found in which the centroids are stable, minimizing is accomplished. A maximum number of iterations can also be set.

Information theory terminology is frequently applied since vector quantization is a suitable application for k-means. The table that maps codes to centroids and vice versa is frequently referred to as a “code book,” and the centroid index or cluster index is also known as a “code.”

Vectors can be quantized using the collection of centroids produced by k-means. The goal of quantization is to identify a vector encoding that minimizes the anticipated distortion.

Every routine assumes that obs is a M by N array with the observation vectors in the rows. The centroid of code word I is in the ith row of the codebook, which is an array of size k by N. The feature dimension of the observation vectors and centroids is the same.

## Python Scipy Clsuter Vq

The Python Scipy has a method `vq()` in a module `scipy.cluster.vq` that gives each observation a code from a code book. The nearest centroid’s code is assigned to each observation vector in the “M” by “N” obs array after it has been compared to the centroids in the code book.

By running the features via the whiten function, it is possible to give the features in obs the desired unit variance. The k-means algorithm or another encoding algorithm can be used to build the code book.

The syntax is given below.

``scipy.cluster.vq.vq(obs, code_book, check_finite=True)``

Where parameters are:

• obs(ndarray): The “M” x “N” array’s rows are observations. The “features” seen during each observation are depicted in the columns. Prior to anything further, the features must be whitened using the whiten function or a similar tool.
• code_book(ndarray): The k-means algorithm is typically used to produce the code book. The columns of the array include the features of the code, and each row contains a unique code.
• check_finite(boolean): if it is necessary to verify that the input matrices only contain finite numbers. Disabling may improve performance, but if the inputs do contain infinities or NaNs, it may cause issues (crashes, non-termination). Default Value: True

The method `vq()` returns `code`(the code book index for each observation is stored in a length M array), `dist`(the separation (distance) between the observation’s nearest code and its distortion) of type ndarray.

Let’s take an example by following the below steps:

Import the required libraries or methods using the below python code.

``````from scipy import cluster
from numpy import array``````

Create features and codebook using the NumPy array.

``````code_book_ = array([[2.,2.,2.],
[1.,1.,1.]])
features_  = array([[  1.6,2.4,1.8],
[  1.6,2.3,2.3],
[  0.9,0.7,1.6]])``````

Pass the above-created array to the method `vq()` using the below code.

``cluster.vq.vq(features_,code_book_)``

## Python Scipy Cluster Vq Kmeans

The method `kmeans()` of Python Scipy in a module `scipy.cluster.vq` where k-means is applied to a collection of observation vectors to create k clusters.

Until the cluster centroids’ positions are stable over multiple iterations, the k-means method modifies the clustering of the observations into groups and updates the cluster centroids.

This algorithm’s implementation uses a threshold to compare the absolute value of the change in the average Euclidean distance between the observations and their corresponding centroids to determine the stability of the centroids.

As a result, a code book is produced that maps centroids to codes and vice versa.

The syntax is given below.

``scipy.cluster.vq.kmeans(obs, k_or_guess, iter=20, thresh=1e-05, check_finite=True)``

Where parameters are:

• obs(ndarray): The M by N array’s rows each represent an observation vector. The features noticed during each observation are represented by columns. The whiten function must be used to first lighten the features.
• k_or_guess(int): The quantity of centroids to produce. Each centroid is given a code, which is also the centroid’s row index in the created code book matrix. By randomly picking observations from the observation matrix, the initial k centroids are determined. The initial k centroids can also be specified by passing a k by N array.
• iter(int): How many times to run k-means to get the codebook with the least distortion as a result. If the initial centroids are supplied with an array for the k or guess parameter, this argument is ignored. The k-means algorithm’s number of iterations is not represented by this parameter.
• thresh(float): If the distortion change since the last iteration of the k-means algorithm is less than or equal to the threshold, the algorithm is terminated.
• check_finite(boolean): If it is necessary to verify that the input matrices only contain finite numbers. Disabling may improve performance, but if the inputs do contain infinities or NaNs, it may cause issues (crashes, non-termination). Standard: True

The method `kmeans()` returns `codebook` (an array of k centroids, k by N. The code I stands for the ith centroid codebook[i]. Although not necessarily the least distortion globally, the centroids and codes produced represent the lowest distortion seen.

Because centroids assigned to no observations are eliminated throughout iterations, it should be noted that the number of centroids is not always the same as the k or guess parameter), `distortion`(The average Euclidean distance between the centroids produced and the observations passed (not squared).

Keep in mind how the k-means algorithm differs from the traditional definition of distortion, which is the sum of the squared distances).

Let’s see with an example by following the below steps:

Import the required libraries or methods using the below code.

``````from scipy import cluster
import numpy as np
import matplotlib.pyplot as plt``````

Create 50 data points with the features and whitened them using the below code.

``````pts_ = 50
rng_ = np.random.default_rng()
a_ = rng_.multivariate_normal([0, 0], [[3, 1], [1, 3]], size=pts_)
b_ = rng_.multivariate_normal([30, 10],
[[10, 2], [2, 1]],
size=pts_)
features_ = np.concatenate((a_, b_))
# whiten the data
whitened_ = cluster.vq.whiten(features_)``````

Look for two clusters in the data using the below code.

``codebook_, distortion_ = cluster.vq.kmeans(whitened_, 2)``

Plot the whitened data with the red centres using the below code.

``````plt.scatter(whitened_[:, 0], whitened_[:, 1])
plt.scatter(codebook_[:, 0], codebook_[:, 1], c='r')
plt.show()``````

## Python Scipy Cluster Vq Whiten

The Python Scipy has a method `whiten()` in a module `scipy.cluster` that normalise a collection of observations by pre-feature.

It is advantageous to scale each feature dimension of the observation set by its standard deviation before doing k-means (i.e., “whiten” it, like in “white noise,” when each frequency has an equal amount of power). Each feature’s unit variance is calculated by dividing its average standard deviation over all observations.

The syntax is given below.

``scipy.cluster.vq.whiten(obs, check_finite=True)``

Where parameters are:

• obs(ndarray): An observation belongs to each row of the array. The features observed during each observation are represented in the columns.
• check_finite(boolean): If it is necessary to verify that the input matrices only contain finite numbers. Disabling may improve performance, but if the inputs do contain infinities or NaNs, it may cause issues (crashes, non-termination). Standard: True

The method `whiten()` returns `result`(contains the obs values scaled by each column’s standard deviation).

Let’s take an example by following the below steps:

Import the required libraries or methods using the below python code.

``from scipy import cluster``

Create features that represent the literacy rate of the top 9 states in the USA such as `New Hampshire = 94.20%, Minnesota = 94.00%, North Dakota = 93.70%, Vermont = 93.40%, South Dakota = 93.00%, Nebraska = 92.70%, Wisconsin = 92.70%, Maine = 92.60%, Iowa = 92.50%` using the below code.

``````usa_lit_features_  = np.array([[94.2, 94.0, 93.70],
[93.40, 93.0, 92.7],
[92.7, 92.60, 92.50,]])``````

Now whiten the data using the below code.

``cluster.vq.whiten(usa_lit_features_)``

This is how to normalize the collection of observations using the method `whiten()` of Python Scipy.

We have learned about how to compute the cluster from collections of observations using the method like `vq`, `kmeans` and etc, with the following topics.

• What is cluster Vq in Scipy?
• How to assign code to observations from the codebook using the method vq()
• How to cluster the given data using the Kmeans
• Python Scipy Cluster Vq Whiten

You may like the following Python Scipy tutorials: