Scikit learn Feature Selection

In this Python tutorial, we will learn about the Scikit learn Feature Selection in python and we will also cover different examples related to Feature selection. And, we will cover these topics.

  • Scikit learn Feature Selection
  • Scikit learn Feature Selection Pipeline
  • Scikit learn Feature Selection chi2
  • Scikit learn Feature Selection rfe
  • Scikit learn Feature Selection selectkbest
  • Scikit learn Feature Selection Tree-based estimator
  • Scikit learn Feature Selection Classification
  • Scikit learn Feature Selection mutual information
  • Scikit learn Feature Selection PCA

Scikit learn Feature Selection

In this section, we will learn about How scikit learn Feature Selection work in Python.

  • Feature selection is used when we develop a predictive model it is used to reduce the number of input variables.
  • It is also involved in evaluating the relationship between each input variable and the target variable.

Code:

In the following code, we will import varianceThreshold from sklearn.feature_selection from which we can select the feature.

Moreover, select = VarianceThreshold(threshold=(.8 * (1 – .8))) is used to calculate the variance threshold from feature selection.

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
select = VarianceThreshold(threshold=(.8 * (1 - .8)))
select.fit_transform(X)

Output:

After running the above code, we get the following output in which we can see that the variance threshold remove all the zero variance feature which have the same value in all the sample.

scikit learn feature selection
scikit learn feature selection

Also, read: Scikit-learn Vs Tensorflow

Scikit learn Feature Selection Pipeline

In this section, we will learn about Scikit learn Feature Selection Pipeline work in Python.

The pipeline is used linearly to apply a series of statements. It is used to remove some less important features in the training set and also select the best feature which improves the accuracy of the model.

Code:

In the following code, we will import pipeline from sklearn.pipeline.The pipeline is used to remove some less important features and also select the best feature which improves the accuracy.

  • pipeline.fit(x_train, y_train) is used to avoid leaking the test set into the train set.
  • pipeline.score(x_test, y_test) is used to calculate the score of the model.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
x, y = make_classification(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(x, y,random_state=0)
pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

pipeline.fit(x_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
pipeline.score(x_test, y_test)

Output:

After running the above code we get the following output in which we can see that the improved accuracy score of the model through the pipeline is shown on the screen.

scikit learn feature selection pipeline
scikit learn feature selection pipeline

Read: Scikit-learn logistic regression

Scikit learn Feature Selection chi2

In this section, we will learn about How scikit learn Feature Selection chi2 work in python.

Chi2 test is used to measure dependences between the non-linear variable. It only contains non-negative variables such as boolean or frequencies.

Code:

In the following code, we will import chi2 from sklearn.feature_selection which measure the dependencies between non-linear variable.

  • X, y = load_iris(return_X_y=True) is used to load the data from iris.
  • X.shape is used to mange the shape of data.
from sklearn.datasets import load_iris
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
X.shape

Output:

After running the above code we get the following output in which we can see that the non-linear variable is printed on the screen.

scikit learn feature selection chi2
scikit learn feature selection chi2

Read: Scikit learn Decision Tree

Scikit learn Feature Selection rfe

In this section, we will learn about How scikit learn Feature selection RFE work in python.

RFE means recursive feature elimination that assigns weights to a feature. The target of Recursive feature elimination is to choose the smaller set of features.

Code:

In the following code, we will import RFE from sklearn.feature_selection by which we choose the smaller set of features.

  • digits = load_digits() is used to load the digit dataset.
  • recursivefeatureelimination = RFE(estimator=svc, n_features_to_select=1, step=1) is used to create RFE object and rank each pixel.
  • plot.title(“Ranking of pixels with RFE”) is used to give the title to the window.
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plot


digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target


svc = SVC(kernel="linear", C=1)
recursivefeatureelimination = RFE(estimator=svc, n_features_to_select=1, step=1)
recursivefeatureelimination.fit(X, y)
ranking = recursivefeatureelimination.ranking_.reshape(digits.images[0].shape)


plot.matshow(ranking, cmap=plot.cm.Blues)
plot.colorbar()
plot.title("Ranking of pixels with RFE")
plot.show()

Output:

After running the above code, we get the following output in which we can see that ranking of pixel RFE is plotted on the screen.

scikit learn feature selection RFE
scikit learn feature selection RFE

Read: Scikit learn accuracy_score

Scikit learn Feature Selection selectkbest

In the following section, we will learn about How Scikit learn Feature Selection selectKbest work in Python.

  • Before moving forward we should have a piece of knowledge of selectkbest.
  • Selectkbest is a process of extracting best features of given dataset.It can select the feature according to the K highest score.

Code:

In the following code, we will import SelectkBest from sklearn.feature_selection by which we can extract the best feature of the dataset.

  • from sklearn.datasets import load_iris is used to load the iris dataset from which we can collect the data.
  • X_new = SelectKBest(chi2, k=2).fit_transform(X, y) is used to extract the best feature of the dataset.
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

Output:

After running the above code, we get the following output in which we can see that SelectkBest process starts working and is used to extract the best feature of the dataset.

scikit learn feature selection SelectkBest
scikit learn feature selection SelectkBest

Read: Scikit learn Hierarchical Clustering

Scikit learn Feature Selection Tree-based estimator

In this section, we will learn about how Scikit learn Feature Selection Tree-based work in Python.

The tree-based estimator is used to determine impurity-based feature importance which in turn can be used to cancel an unimportance feature.

Code:

In the following code, we will import ExtraTreesClassifier from sklearn.esemble by which we can determine the impurity of the features.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape

After running the above code we get the following output in which we can see that the shape of the tree classifier is printed on the screen.

scikit learn feature selection tree  based-
scikit learn feature selection tree-based

Here we can estimate the importance of the feature how important is a feature of the tree.

clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_  
scikit learn feature selection tree based estimator
scikit learn feature selection tree-based estimator

Here we can see the extra tree-based estimator cancel the unimportant features and determine the impurity-based feature.

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape 
scikit learn feature selection tree based
scikit learn feature selection tree-based

Read: Scikit learn Hidden Markov Model

Scikit learn Feature Selection Classification

In this section, we will learn about Scikit learn Feature Selection Classification in Python.

Classification is supervised learning it is used for sorting the different things into different categories.

Code:

In the following code, we will import different libraries from which we can select the feature of the different classifiers.

  • x, y = load_iris(return_X_y=True) is used to load the iris dataset.
  • sequentialfeatureselector = SequentialFeatureSelector(knn, n_features_to_select=3) is used to select the feature of classifier.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
x, y = load_iris(return_X_y=True)
knn = KNeighborsClassifier(n_neighbors=3)
sequentialfeatureselector = SequentialFeatureSelector(knn, n_features_to_select=3)
sequentialfeatureselector.fit(x, y)

Output:

After running the above we get the following output in which we can see that the feature selection classification is done on the screen.

scikit learn feature selection classification
scikit learn feature selection classification

Read: Scikit learn Ridge Regression

Scikit learn Feature Selection mutual information

In this section, we will learn about How Scikit learn Feature selection mutual information in python.

Mutual information is used to measure the dependency between the variables. If the two random variables are independent the mutual information is equal to zero.

Code:

In the following code, we will import f_regression, mutual_info_regression from sklearn.feature_selection by which we can select a feature from mutual information.

  • mutualinfo = mutual_info_regression(x, y) is used to get the mutual information.
  • plot.figure(figsize=(15, 5)) is used to plot the figure on the screen.
  • plot.scatter(x[:, i], y, edgecolor=”black”, s=20) is used to plot the scatter graph.
  • plot.title(“F-test={:.2f}, MI={:.2f}”.format(f_test[i], mutualinfo[i]), fontsize=16) is used to give the title to the graph.
import numpy as np
import matplotlib.pyplot as plot
from sklearn.feature_selection import f_regression, mutual_info_regression

np.random.seed(0)
x = np.random.rand(1000, 3)
y = x[:, 0] + np.sin(6 * np.pi * x[:, 1]) + 0.1 * np.random.randn(1000)

f_test, _ = f_regression(x, y)
f_test /= np.max(f_test)

mutualinfo = mutual_info_regression(x, y)
mutualinfo /= np.max(mutualinfo)

plot.figure(figsize=(15, 5))
for i in range(3):
    plot.subplot(1, 3, i + 1)
    plot.scatter(x[:, i], y, edgecolor="black", s=20)
    plot.xlabel("$x_{}$".format(i + 1), fontsize=14)
    if i == 0:
        plot.ylabel("$y$", fontsize=14)
    plot.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mutualinfo[i]), fontsize=16)
plot.show()

Output:

After running the above code we get the following output in which we can see that the scatter graph is plotted on the screen.

scikit learn feature selection mutual information
scikit learn feature selection mutual information

Read: Scikit learn Random Forest

Scikit learn Feature Selection PCA

In this section, we will learn about how Scikit learn Feature Selection work in Python.

  • Before moving forward we should have a piece of knowledge about Scikit learn PCA.
  • PCA stands for Principal component analysis which is used as a linear dimensionality using singular value decomposition of data.

Code:

In the following code, we will import PCA from sklearn.decomposition by which we can select the feature.

  • dataseturl = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv” is used to load the dataset.
  • datanames = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’] that is give the names from the dataset.
  • PCA(n_components=3) is used for extracting the features.
  • print(“Explained Variance: %s” % fit.explained_variance_ratio_) is used to print the variance on the screen.
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA

dataseturl = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
datanames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(dataseturl, names=datanames)
array = dataframe.values
x = array[:,0:7]
y = array[:,7]

pca = PCA(n_components=3)
fit = pca.fit(x)

print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Output:

After running the above code we get the following output in which we can see that the variance is printed on the screen.

scikit learn feature selection PCA
scikit learn feature selection PCA

Also, take a look at some more scikit learn tutorials.

So, in this tutorial we discussed Scikit learn Feature Selection and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.

  • Scikit learn Feature Selection
  • Scikit learn Feature Selection Pipeline
  • Scikit learn Feature Selection chi2
  • Scikit learn Feature Selection rfe
  • Scikit learn Feature Selection selectkbest
  • Scikit learn Feature Selection Tree-based estimator
  • Scikit learn Feature Selection Classification
  • Scikit learn Feature Selection mutual information
  • Scikit learn Feature Selection PCA