Scikit learn Cross-Validation [Helpful Guide]

In this Python tutorial, we will learn How Scikit learn cross-validation works in python and we will also cover different examples related to Scikit learn cross-validation. Moreover, we will cover these topics.

  • Scikit learn cross-validation
  • Scikit learn cross-validation score
  • Scikit learn Cross-validation lasso
  • Scikit learn cross-validation predict
  • Scikit learn cross-validation time series
  • Scikit learn cross-validation split
  • Scikit learn cross-validation confusion matrix
  • Scikit learn cross-validation hyperparameter
  • Scikit learn cross-validation shuffle
  • Scikit learn cross-validation grid search

Scikit learn Cross-validation

In this section, we will learn about Scikit learn cross-validation works in python.

Cross-validation is defined as a process in which we trained our model using a dataset and then evaluate using a supportive dataset.

Code:

In the following code, we will import some libraries from which we train our model and also evaluate that.

  • x, y = datasets.load_iris(return_X_y=True) is used to load the dataset.
  • x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0) is used to split the dataset into train data and test data.
  • x_train.shape, y_train.shape is used to evaluate the shape of the train model.
  • classifier = svm.SVC(kernel=’linear’, C=1).fit(x_train, y_train) is used to fit the model.
  • scores = cross_val_score(classifier, x, y, cv=7) is used to calculate the cross value score.
import numpy as num
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

x, y = datasets.load_iris(return_X_y=True)
x.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(
       x, y, test_size=0.4, random_state=0)

x_train.shape, y_train.shape
x_test.shape, y_test.shape
classifier = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
classifier.score(x_test, y_test)
from sklearn.model_selection import cross_val_score
classifier = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(classifier, x, y, cv=7)

Output:

After running the above code, we get the following output in which we can see that cross-validation metrics score in the form of the array is printed on the screen.

scikit learn cross validation
Scikit learn cross-validation

Read: Scikit-learn Vs Tensorflow

Scikit learn cross-validation score

In this section, we will learn about how Scikit learn cross-validation score works in python.

Cross-validation scores define as the process to estimate the ability of the model of new data and calculate the score of the data.

Code:

In the following code, we will import some libraries from which we can calculate the cross-validation score.

  • diabetes = datasets.load_diabetes() is used to load the data.
  • x = diabetes.data[:170] is used to calculate the diabetes data.
  • print(cross_val_score(lasso, x, y, cv=5)) is used to print the score on the screen.
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
x = diabetes.data[:170]
y = diabetes.target[:170]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, x, y, cv=5))

Output:

After running the above code, we get the following output in which we can see that the cross-validation score is printed on the screen.

scikit learn cross validation score
Scikit learn cross-validation score

Read: Scikit learn Decision Tree

Scikit learn Cross-validation lasso

In this section, we will learn about how Scikit learn cross-validation lasso works in python.

Lasso stands for least absolute shrinkage and selector operator which is used to determine the weight of the penalty term.

Code:

In the following code, we will import some libraries from which we can calculate the cross-validation lasso score.

  • x, y = make_regression(noise=5, random_state=0) is used to make or generate the regression.
  • regression = LassoCV(cv=7, random_state=0).fit(x, y) is used to fit the lasso model.
  • regression.score(x, y) is used to calculate lasso score.

from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
x, y = make_regression(noise=5, random_state=0)
regression = LassoCV(cv=7, random_state=0).fit(x, y)
regression.score(x, y)

Output:

In the following output, we can see that the lasso score is calculated and the result is printed on the screen.

scikit learn cross validation lasso score
Scikit learn cross-validation lasso score

Read: Scikit learn Hidden Markov Model

Scikit learn cross-validation predict

In this section, we will learn about how Scikit learn cross-validation predict work in python.

  • Scikit learn cross validation predict method is used to predicting the errror by visualizing them.
  • Cross validation is used to evaluating the data and it also use different part of data to train and test the model.

Code:

In the following code, we will import some libraries from which we can evaluate the prediction through cross-validation.

  • x, y = datasets.load_diabetes(return_X_y=True) is used to load the dataset.
  • predict = cross_val_predict(linearmodel,x, y, cv=10) is used to predict the model and return an array of the same size.
  • fig, axis = plot.subplots() is used to plot the figure on the screen.
  • axis.scatter(y, predict, edgecolors=(0, 0, 0)) is used to plot the scatter plot on the graph.
  • axis.plot([y.min(), y.max()], [y.min(), y.max()], “b–“, lw=6) is used to plot the axis on the graph.
  • axis.set_xlabel(“Measured”) is used to plot the x label on the graph.
  • axis.set_ylabel(“Predicted”) is used to plot the y label on the graph.
from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plot

linearmodel = linear_model.LinearRegression()
x, y = datasets.load_diabetes(return_X_y=True)


predict = cross_val_predict(linearmodel,x, y, cv=10)

fig, axis = plot.subplots()
axis.scatter(y, predict, edgecolors=(0, 0, 0))
axis.plot([y.min(), y.max()], [y.min(), y.max()], "b--", lw=6)
axis.set_xlabel("Measured")
axis.set_ylabel("Predicted")
plot.show()

Output:

After running the above code, we get the following output in which we can see that the graph is plotted on the screen with cross-validation prediction.

scikit learn cross validation predict
scikit learn cross-validation predict

Read: Scikit learn Hierarchical Clustering

Scikit learn cross-validation time series

In this section, we will learn about how Scikit learn cross-validation time series work in python.

  • Scikit learn crossvalidation time series is defined as a these is a secries of test set whuch consist the single observation.
  • The trainig set consist only that observation that comminf before in time to the observation which form test set.
  • In the timeseries cross validation no futrure observation is considered in constructing forecast.

Code:

In the following code, we will import some libraries from which we can see how the data can be split through time series.

  • x = num.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) is used to give the value to the x.
  • y = num.array([1, 2, 3, 4, 5, 6]) is used to give the value to the y.
  • print(timeseriescv) is used to print the timeseries cross validation data.
  • x = num.random.randn(12, 2) is used to fix test size to 2 with 12 samples.
  • print(“TRAIN:”, train_index, “TEST:”, test_index) is used to print the train and test data.
import numpy as num
from sklearn.model_selection import TimeSeriesSplit
x = num.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = num.array([1, 2, 3, 4, 5, 6])
timeseriescv = TimeSeriesSplit()
print(timeseriescv)
for train_index, test_index in timeseriescv.split(x):
     print("TRAIN:", train_index, "TEST:", test_index)
     x_train, x_test = x[train_index], x[test_index]
     y_train, y_test = y[train_index], y[test_index]

x = num.random.randn(12, 2)
y = num.random.randint(0, 2, 12)
timeseriescv = TimeSeriesSplit(n_splits=3, test_size=2)
for train_index, test_index in tscv.split(x):
    print("TRAIN:", train_index, "TEST:", test_index)
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

Output:

In the following output, we can see that the train and test data is split with the time series cross-validation.

scikit learn cross validation time series
Scikit learn cross-validation time series

Read: Scikit learn Ridge Regression

Scikit learn cross-validation split

In this section, we will learn about how Scikit learn cross-validation split in python.

  • Cross-validation is defined as a process that is used to evaluate the model on finite data samples.
  • cross-validation data can be split into a number of groups with a single parameter called K.

Code:

In the following code, we will import some libraries from which the model can be split into a number of groups.

  • num.random.seed(1338) is used to generate random numbers.
  • n_splits = 6 is used to split the data.
  • percentiles_classes = [0.1, 0.3, 0.6] is used to generate the class data.
  • groups = num.hstack([[ii] * 10 for ii in range(10)]) is used to evenly split the group
  • fig, axis = plot.subplots() is used to plot the figure.
  • axis.scatter() is used plot the scatter plot.
  • axis.set_title(“{}”.format(type(cv).__name__), fontsize=15) is used to give the title to the graph.
from sklearn.model_selection import (
    TimeSeriesSplit,
    KFold,
    ShuffleSplit,
    StratifiedKFold,
    GroupShuffleSplit,
    GroupKFold,
    StratifiedShuffleSplit,
    StratifiedGroupKFold,
)
import numpy as num
import matplotlib.pyplot as plot
from matplotlib.patches import Patch

num.random.seed(1338)
cmapdata = plot.cm.Paired
cmapcv = plot.cm.coolwarm
n_splits = 6
n_points = 100
x = num.random.randn(100, 10)

percentiles_classes = [0.1, 0.3, 0.6]
y = num.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

groups = num.hstack([[ii] * 10 for ii in range(10)])


def visualize_groups(classes, groups, name):
    # Visualize dataset groups
    fig, axis = plot.subplots()
    axis.scatter(
        range(len(groups)),
        [0.5] * len(groups),
        c=groups,
        marker="_",
        lw=50,
        cmap=cmapdata,
    )
    axis.scatter(
        range(len(groups)),
        [3.5] * len(groups),
        c=classes,
        marker="_",
        lw=50,
        cmap=cmapdata,
    )
    axis.set(
        ylim=[-1, 5],
        yticks=[0.5, 3.5],
        yticklabels=["Data\ngroup", "Data\nclass"],
        xlabel="Sample index",
    )


visualize_groups(y, groups, "nogroups")
def plot_cv_indices(cv, x, y, group, ax, n_splits, lw=10):
    """Create a sample plot for indices of a cross-validation object."""

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=x, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(x))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        axis.scatter(
            range(len(indices)),
            [ii + 0.5] * len(indices),
            c=indices,
            marker="_",
            lw=lw,
            cmap=cmapcv,
            vmin=-0.2,
            vmax=1.2,
        )

    axis.scatter(
        range(len(x)), [ii + 1.5] * len(x), c=y, marker="_", lw=lw, cmap=cmapdata
    )

    axis.scatter(
        range(len(x)), [ii + 2.5] * len(x), c=group, marker="_", lw=lw, cmap=cmapdata
    )

    # Formatting
    yticklabels = list(range(n_splits)) + ["class", "group"]
    axis.set(
        yticks=np.arange(n_splits + 2) + 0.5,
        yticklabels=yticklabels,
        xlabel="Sample index",
        ylabel="CV iteration",
        ylim=[n_splits + 2.2, -0.2],
        xlim=[0, 100],
    )
    axis.set_title("{}".format(type(cv).__name__), fontsize=15)
    return axis
fig, axis = plot.subplots()
cv = KFold(n_splits)
plot_cv_indices(cv, x, y, groups, axis, n_splits)

Output:

After running the above code we get the following output in which we can see that the scikit learn cross-validation split is shown on the screen.

scikit learn cross validation split
scikit learn cross-validation split

Read: Scikit learn Feature Selection

Scikit learn cross-validation confusion matrix

In this section, we will learn how Scikit learn cross-validation matrix works in python.

A cross-validation confusion matrix is defined as an evaluation matrix from where we can estimate the performance of the model.

Code:

In the following code, we will import some libraries from which we can evaluate the model performance.

  • iris = datasets.load_iris() is used to load the iris data.
  • print(iris.DESCR) is used to print the iris data.
  • predicted_targets = num.array([]) is used to predict the target value model.
  • actual_targets = num.array([]) is used to get the actual target value.
  • classifiers = svm.SVC().fit(train_x, train_y) is used to fit the classifier.
  • predicted_labels = classifiers.predict(test_x) is used to predict the label of the test set.
import matplotlib.pyplot as plot
import numpy as num
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
iris = datasets.load_iris()
data = iris.data
target = iris.target
classnames = iris.target_names
classnames
labels, counts = num.unique(target, return_counts=True)
print(iris.DESCR)
def evaluate_model(data_x, data_y):
    k_fold = KFold(10, shuffle=True, random_state=1)

    predicted_targets = num.array([])
    actual_targets = num.array([])

    for train_ix, test_ix in k_fold.split(data_x):
        train_x, train_y, test_x, test_y = data_x[train_ix], data_y[train_ix], data_x[test_ix], data_y[test_ix]

        classifiers = svm.SVC().fit(train_x, train_y)
        predicted_labels = classifiers.predict(test_x)

        predicted_targets = num.append(predicted_targets, predicted_labels)
        actual_targets = num.append(actual_targets, test_y)

    return predicted_targets, actual_targets
  
scikit learn cross validation confusion matrix data
scikit learn cross-validation confusion matrix data

In this part of the code, we will generate the normalized confusion matrix.

  • plot.imshow(cnf_matrix, interpolation=’nearest’, cmap=plt.get_cmap(‘Blues’)) is used plot the matrix.
  • plot.title(title) is used to plot the title on the graph.
  • plot.xticks(tick_marks, classes, rotation=45) is used to plot the x ticks.
  • plot.ylabel(‘True label’) is used to plot the label on the graph.
  • plot.xlabel(‘Predicted label’) is used to plot the x label on the graph.
  • plot_confusion_matrix(predicted_target, actual_target) is used to plot the confusion matrix on the screen.
def plot_confusion_matrix(predicted_labels_list, y_test_list):
    cnf_matrix = confusion_matrix(y_test_list, predicted_labels_list)
    num.set_printoptions(precision=2)

   
    plot.figure()
    generate_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized confusion matrix')
    plot.show()
def generate_confusion_matrix(cnf_matrix, classes, normalize=False, title='Confusion matrix'):
    if normalize:
        cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, num.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    plot.imshow(cnf_matrix, interpolation='nearest', cmap=plt.get_cmap('Blues'))
    plot.title(title)
    plot.colorbar()

    tick_marks = np.arange(len(classes))
    plot.xticks(tick_marks, classes, rotation=45)
    plot.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cnf_matrix.max() / 2.

    for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        plot.text(j, i, format(cnf_matrix[i, j], fmt), horizontalalignment="center",
                 color="black" if cnf_matrix[i, j] > thresh else "blue")

    plot.tight_layout()
    plot.ylabel('True label')
    plot.xlabel('Predicted label')

    return cnf_matrix
predicted_target, actual_target = evaluate_model(data, target)
plot_confusion_matrix(predicted_target, actual_target)

After running the above code we get the following output in which we can see that the confusion matrix is plotted on the screen,

scikit learn cross validation confusion matrix
scikit learn cross-validation confusion matrix

Read: Scikit learn Linear Regression + Examples

Scikit learn cross-validation hyperparameter

In this section, we will learn about Scikit learn cross-validation hyperparameter works in python.

A cross-validation hyperparameter is defined as a process is used for searching ideal model architecture and is also used to evaluate the performance of a model.

Code:

In the following code, we will import some libraries from which we can search the ideal model architecture.

  • paramgrid = {‘max_depth’: [4, 5, 10],’min_samples_split’: [3, 5, 10]} is used to define the parametergrid.
  • x, y = make_classification(n_samples=1000, random_state=0) is used to make classification.
  • base_estimator = SVC(gamma=’scale’) is used to define the base estimtor.
  • sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,factor=2, max_resources=40,aggressive_elimination=True,).fit(x, y) is used to fit the model.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pd
paramgrid = {'max_depth': [4, 5, 10],
               'min_samples_split': [3, 5, 10]}
base_estimator = RandomForestClassifier(random_state=0)
x, y = make_classification(n_samples=1000, random_state=0)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
                          factor=2, resource='n_estimators',
                          max_resources=30).fit(x, y)
sh.best_estimator_
RandomForestClassifier(max_depth=5, n_estimators=24, random_state=0)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pds
paramgrid= {'kernel': ('linear', 'rbf'),
             'C': [2, 10, 100]}
base_estimator = SVC(gamma='scale')
x, y = make_classification(n_samples=1000)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
                          factor=2, min_resources=20).fit(x, y)
sh.n_resources_
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,
                         factor=2, min_resources='exhaust').fit(x, y)
sh.n_resources_
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pds
paramgrid = {'kernel': ('linear', 'rbf'),
              'C': [2, 10, 100]}
base_estimator = SVC(gamma='scale')
x, y = make_classification(n_samples=1000)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
                         factor=2, max_resources=40,
                         aggressive_elimination=False).fit(x, y)
sh.n_resources_
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,
                           factor=2,
                           max_resources=40,
                           aggressive_elimination=True,
                           ).fit(x, y)
sh.n_resources_

Output:

In the following output, we can see that the Scikit learn cross-validation hyperparameter which selects the ideal model that is shown on the screen.

scikit learn cross validation hyperparameter
Scikit learn cross-validation hyperparameter

Read: Scikit learn Hyperparameter Tuning

Scikit learn cross-validation shuffle

In this section, we will learn about scikit learn cross-validation shuffle works in python.

Cross-validation shuffle is defined as user generate train and test split firstly samples of data are shuffled and then split into the train and test set.

Code:

In the following code, we will learn to import some libraries from which we can shuffle the data and after that split it into train and test.

  • x = num.array([[1, 2], [3, 4], [1, 2], [3, 4]]) is used to generate an array.
  • kf = KFold(n_splits=2) is used to split the data.
  • print(“TRAIN:”, train_index, “TEST:”, test_index) is used to print the train and test data.
import numpy as num
from sklearn.model_selection import KFold
x = num.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = num.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(x)
print(kf)
for train_index, test_index in kf.split(x):
     print("TRAIN:", train_index, "TEST:", test_index)
     x_train, x_test = x[train_index], x[test_index]
     y_train, y_test = y[train_index], y[test_index]

Output:

After running the above code we get the following output in which we can see that the data shuffle after that split it into train and test data.

scikit learn cross validation shuffle
scikit learn cross-validation shuffle

Read: Scikit learn hidden_layer_sizes

In this section, we will learn about how Scikit learn cross-validation grid search works in python.

Cross-validation Grid search is defined as a process that selects the best parameter for all the parameterized grid models.

Code:

In the following code, we will import some libraries from which we can select the best parameter from the grid.

  • iris = datasets.load_iris() is used to load the iris dataset.
  • parameters = {‘kernel’:(‘linear’, ‘rbf’), ‘C’:[1, 12]} is used to define the parameters.
  • classifier.fit(iris.data, iris.target) is used to fit the model.
  • sorted(classifier.cv_results_.keys()) is used to sorted the classifier.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 12]}
svc = svm.SVC()
classifier = GridSearchCV(svc, parameters)
classifier.fit(iris.data, iris.target)
sorted(classifier.cv_results_.keys())

Output:

In the following output, we can see that the best parameter is shown on the screen which is searched from the parameter grid.

scikit learn cross validation grid search
scikit learn cross-validation grid search

Also, take a look at some more Scikit learn tutorials.

So, in this tutorial, we discussed Scikit learn cross-validation, and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.

  • Scikit learn cross-validation
  • Scikit learn cross-validation score
  • Scikit learn Cross-validation lasso
  • Scikit learn cross-validation predict
  • Scikit learn cross-validation time series
  • Scikit learn cross-validation split
  • Scikit learn cross-validation confusion matrix
  • Scikit learn cross-validation hyperparameter
  • Scikit learn cross-validation shuffle
  • Scikit learn cross-validation grid search