In this Python tutorial, we will learn How Scikit learn cross-validation works in python and we will also cover different examples related to Scikit learn cross-validation. Moreover, we will cover these topics.
- Scikit learn cross-validation
- Scikit learn cross-validation score
- Scikit learn Cross-validation lasso
- Scikit learn cross-validation predict
- Scikit learn cross-validation time series
- Scikit learn cross-validation split
- Scikit learn cross-validation confusion matrix
- Scikit learn cross-validation hyperparameter
- Scikit learn cross-validation shuffle
- Scikit learn cross-validation grid search
Scikit learn Cross-validation
In this section, we will learn about Scikit learn cross-validation works in python.
Cross-validation is defined as a process in which we trained our model using a dataset and then evaluate using a supportive dataset.
Code:
In the following code, we will import some libraries from which we train our model and also evaluate that.
- x, y = datasets.load_iris(return_X_y=True) is used to load the dataset.
- x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0) is used to split the dataset into train data and test data.
- x_train.shape, y_train.shape is used to evaluate the shape of the train model.
- classifier = svm.SVC(kernel=’linear’, C=1).fit(x_train, y_train) is used to fit the model.
- scores = cross_val_score(classifier, x, y, cv=7) is used to calculate the cross value score.
import numpy as num
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
x, y = datasets.load_iris(return_X_y=True)
x.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.4, random_state=0)
x_train.shape, y_train.shape
x_test.shape, y_test.shape
classifier = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
classifier.score(x_test, y_test)
from sklearn.model_selection import cross_val_score
classifier = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(classifier, x, y, cv=7)
Output:
After running the above code, we get the following output in which we can see that cross-validation metrics score in the form of the array is printed on the screen.
Read: Scikit-learn Vs Tensorflow
Scikit learn cross-validation score
In this section, we will learn about how Scikit learn cross-validation score works in python.
Cross-validation scores define as the process to estimate the ability of the model of new data and calculate the score of the data.
Code:
In the following code, we will import some libraries from which we can calculate the cross-validation score.
- diabetes = datasets.load_diabetes() is used to load the data.
- x = diabetes.data[:170] is used to calculate the diabetes data.
- print(cross_val_score(lasso, x, y, cv=5)) is used to print the score on the screen.
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
x = diabetes.data[:170]
y = diabetes.target[:170]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, x, y, cv=5))
Output:
After running the above code, we get the following output in which we can see that the cross-validation score is printed on the screen.
Read: Scikit learn Decision Tree
Scikit learn Cross-validation lasso
In this section, we will learn about how Scikit learn cross-validation lasso works in python.
Lasso stands for least absolute shrinkage and selector operator which is used to determine the weight of the penalty term.
Code:
In the following code, we will import some libraries from which we can calculate the cross-validation lasso score.
- x, y = make_regression(noise=5, random_state=0) is used to make or generate the regression.
- regression = LassoCV(cv=7, random_state=0).fit(x, y) is used to fit the lasso model.
- regression.score(x, y) is used to calculate lasso score.
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
x, y = make_regression(noise=5, random_state=0)
regression = LassoCV(cv=7, random_state=0).fit(x, y)
regression.score(x, y)
Output:
In the following output, we can see that the lasso score is calculated and the result is printed on the screen.
Read: Scikit learn Hidden Markov Model
Scikit learn cross-validation predict
In this section, we will learn about how Scikit learn cross-validation predict work in python.
- Scikit learn cross validation predict method is used to predicting the errror by visualizing them.
- Cross validation is used to evaluating the data and it also use different part of data to train and test the model.
Code:
In the following code, we will import some libraries from which we can evaluate the prediction through cross-validation.
- x, y = datasets.load_diabetes(return_X_y=True) is used to load the dataset.
- predict = cross_val_predict(linearmodel,x, y, cv=10) is used to predict the model and return an array of the same size.
- fig, axis = plot.subplots() is used to plot the figure on the screen.
- axis.scatter(y, predict, edgecolors=(0, 0, 0)) is used to plot the scatter plot on the graph.
- axis.plot([y.min(), y.max()], [y.min(), y.max()], “b–“, lw=6) is used to plot the axis on the graph.
- axis.set_xlabel(“Measured”) is used to plot the x label on the graph.
- axis.set_ylabel(“Predicted”) is used to plot the y label on the graph.
from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plot
linearmodel = linear_model.LinearRegression()
x, y = datasets.load_diabetes(return_X_y=True)
predict = cross_val_predict(linearmodel,x, y, cv=10)
fig, axis = plot.subplots()
axis.scatter(y, predict, edgecolors=(0, 0, 0))
axis.plot([y.min(), y.max()], [y.min(), y.max()], "b--", lw=6)
axis.set_xlabel("Measured")
axis.set_ylabel("Predicted")
plot.show()
Output:
After running the above code, we get the following output in which we can see that the graph is plotted on the screen with cross-validation prediction.
Read: Scikit learn Hierarchical Clustering
Scikit learn cross-validation time series
In this section, we will learn about how Scikit learn cross-validation time series work in python.
- Scikit learn crossvalidation time series is defined as a these is a secries of test set whuch consist the single observation.
- The trainig set consist only that observation that comminf before in time to the observation which form test set.
- In the timeseries cross validation no futrure observation is considered in constructing forecast.
Code:
In the following code, we will import some libraries from which we can see how the data can be split through time series.
- x = num.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) is used to give the value to the x.
- y = num.array([1, 2, 3, 4, 5, 6]) is used to give the value to the y.
- print(timeseriescv) is used to print the timeseries cross validation data.
- x = num.random.randn(12, 2) is used to fix test size to 2 with 12 samples.
- print(“TRAIN:”, train_index, “TEST:”, test_index) is used to print the train and test data.
import numpy as num
from sklearn.model_selection import TimeSeriesSplit
x = num.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = num.array([1, 2, 3, 4, 5, 6])
timeseriescv = TimeSeriesSplit()
print(timeseriescv)
for train_index, test_index in timeseriescv.split(x):
print("TRAIN:", train_index, "TEST:", test_index)
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
x = num.random.randn(12, 2)
y = num.random.randint(0, 2, 12)
timeseriescv = TimeSeriesSplit(n_splits=3, test_size=2)
for train_index, test_index in tscv.split(x):
print("TRAIN:", train_index, "TEST:", test_index)
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
Output:
In the following output, we can see that the train and test data is split with the time series cross-validation.
Read: Scikit learn Ridge Regression
Scikit learn cross-validation split
In this section, we will learn about how Scikit learn cross-validation split in python.
- Cross-validation is defined as a process that is used to evaluate the model on finite data samples.
- cross-validation data can be split into a number of groups with a single parameter called K.
Code:
In the following code, we will import some libraries from which the model can be split into a number of groups.
- num.random.seed(1338) is used to generate random numbers.
- n_splits = 6 is used to split the data.
- percentiles_classes = [0.1, 0.3, 0.6] is used to generate the class data.
- groups = num.hstack([[ii] * 10 for ii in range(10)]) is used to evenly split the group
- fig, axis = plot.subplots() is used to plot the figure.
- axis.scatter() is used plot the scatter plot.
- axis.set_title(“{}”.format(type(cv).__name__), fontsize=15) is used to give the title to the graph.
from sklearn.model_selection import (
TimeSeriesSplit,
KFold,
ShuffleSplit,
StratifiedKFold,
GroupShuffleSplit,
GroupKFold,
StratifiedShuffleSplit,
StratifiedGroupKFold,
)
import numpy as num
import matplotlib.pyplot as plot
from matplotlib.patches import Patch
num.random.seed(1338)
cmapdata = plot.cm.Paired
cmapcv = plot.cm.coolwarm
n_splits = 6
n_points = 100
x = num.random.randn(100, 10)
percentiles_classes = [0.1, 0.3, 0.6]
y = num.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])
groups = num.hstack([[ii] * 10 for ii in range(10)])
def visualize_groups(classes, groups, name):
# Visualize dataset groups
fig, axis = plot.subplots()
axis.scatter(
range(len(groups)),
[0.5] * len(groups),
c=groups,
marker="_",
lw=50,
cmap=cmapdata,
)
axis.scatter(
range(len(groups)),
[3.5] * len(groups),
c=classes,
marker="_",
lw=50,
cmap=cmapdata,
)
axis.set(
ylim=[-1, 5],
yticks=[0.5, 3.5],
yticklabels=["Data\ngroup", "Data\nclass"],
xlabel="Sample index",
)
visualize_groups(y, groups, "nogroups")
def plot_cv_indices(cv, x, y, group, ax, n_splits, lw=10):
"""Create a sample plot for indices of a cross-validation object."""
# Generate the training/testing visualizations for each CV split
for ii, (tr, tt) in enumerate(cv.split(X=x, y=y, groups=group)):
# Fill in indices with the training/test groups
indices = np.array([np.nan] * len(x))
indices[tt] = 1
indices[tr] = 0
# Visualize the results
axis.scatter(
range(len(indices)),
[ii + 0.5] * len(indices),
c=indices,
marker="_",
lw=lw,
cmap=cmapcv,
vmin=-0.2,
vmax=1.2,
)
axis.scatter(
range(len(x)), [ii + 1.5] * len(x), c=y, marker="_", lw=lw, cmap=cmapdata
)
axis.scatter(
range(len(x)), [ii + 2.5] * len(x), c=group, marker="_", lw=lw, cmap=cmapdata
)
# Formatting
yticklabels = list(range(n_splits)) + ["class", "group"]
axis.set(
yticks=np.arange(n_splits + 2) + 0.5,
yticklabels=yticklabels,
xlabel="Sample index",
ylabel="CV iteration",
ylim=[n_splits + 2.2, -0.2],
xlim=[0, 100],
)
axis.set_title("{}".format(type(cv).__name__), fontsize=15)
return axis
fig, axis = plot.subplots()
cv = KFold(n_splits)
plot_cv_indices(cv, x, y, groups, axis, n_splits)
Output:
After running the above code we get the following output in which we can see that the scikit learn cross-validation split is shown on the screen.
Read: Scikit learn Feature Selection
Scikit learn cross-validation confusion matrix
In this section, we will learn how Scikit learn cross-validation matrix works in python.
A cross-validation confusion matrix is defined as an evaluation matrix from where we can estimate the performance of the model.
Code:
In the following code, we will import some libraries from which we can evaluate the model performance.
- iris = datasets.load_iris() is used to load the iris data.
- print(iris.DESCR) is used to print the iris data.
- predicted_targets = num.array([]) is used to predict the target value model.
- actual_targets = num.array([]) is used to get the actual target value.
- classifiers = svm.SVC().fit(train_x, train_y) is used to fit the classifier.
- predicted_labels = classifiers.predict(test_x) is used to predict the label of the test set.
import matplotlib.pyplot as plot
import numpy as num
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
iris = datasets.load_iris()
data = iris.data
target = iris.target
classnames = iris.target_names
classnames
labels, counts = num.unique(target, return_counts=True)
print(iris.DESCR)
def evaluate_model(data_x, data_y):
k_fold = KFold(10, shuffle=True, random_state=1)
predicted_targets = num.array([])
actual_targets = num.array([])
for train_ix, test_ix in k_fold.split(data_x):
train_x, train_y, test_x, test_y = data_x[train_ix], data_y[train_ix], data_x[test_ix], data_y[test_ix]
classifiers = svm.SVC().fit(train_x, train_y)
predicted_labels = classifiers.predict(test_x)
predicted_targets = num.append(predicted_targets, predicted_labels)
actual_targets = num.append(actual_targets, test_y)
return predicted_targets, actual_targets
In this part of the code, we will generate the normalized confusion matrix.
- plot.imshow(cnf_matrix, interpolation=’nearest’, cmap=plt.get_cmap(‘Blues’)) is used plot the matrix.
- plot.title(title) is used to plot the title on the graph.
- plot.xticks(tick_marks, classes, rotation=45) is used to plot the x ticks.
- plot.ylabel(‘True label’) is used to plot the label on the graph.
- plot.xlabel(‘Predicted label’) is used to plot the x label on the graph.
- plot_confusion_matrix(predicted_target, actual_target) is used to plot the confusion matrix on the screen.
def plot_confusion_matrix(predicted_labels_list, y_test_list):
cnf_matrix = confusion_matrix(y_test_list, predicted_labels_list)
num.set_printoptions(precision=2)
plot.figure()
generate_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized confusion matrix')
plot.show()
def generate_confusion_matrix(cnf_matrix, classes, normalize=False, title='Confusion matrix'):
if normalize:
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, num.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
plot.imshow(cnf_matrix, interpolation='nearest', cmap=plt.get_cmap('Blues'))
plot.title(title)
plot.colorbar()
tick_marks = np.arange(len(classes))
plot.xticks(tick_marks, classes, rotation=45)
plot.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cnf_matrix.max() / 2.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
plot.text(j, i, format(cnf_matrix[i, j], fmt), horizontalalignment="center",
color="black" if cnf_matrix[i, j] > thresh else "blue")
plot.tight_layout()
plot.ylabel('True label')
plot.xlabel('Predicted label')
return cnf_matrix
predicted_target, actual_target = evaluate_model(data, target)
plot_confusion_matrix(predicted_target, actual_target)
After running the above code we get the following output in which we can see that the confusion matrix is plotted on the screen,
Read: Scikit learn Linear Regression + Examples
Scikit learn cross-validation hyperparameter
In this section, we will learn about Scikit learn cross-validation hyperparameter works in python.
A cross-validation hyperparameter is defined as a process is used for searching ideal model architecture and is also used to evaluate the performance of a model.
Code:
In the following code, we will import some libraries from which we can search the ideal model architecture.
- paramgrid = {‘max_depth’: [4, 5, 10],’min_samples_split’: [3, 5, 10]} is used to define the parametergrid.
- x, y = make_classification(n_samples=1000, random_state=0) is used to make classification.
- base_estimator = SVC(gamma=’scale’) is used to define the base estimtor.
- sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,factor=2, max_resources=40,aggressive_elimination=True,).fit(x, y) is used to fit the model.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pd
paramgrid = {'max_depth': [4, 5, 10],
'min_samples_split': [3, 5, 10]}
base_estimator = RandomForestClassifier(random_state=0)
x, y = make_classification(n_samples=1000, random_state=0)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
factor=2, resource='n_estimators',
max_resources=30).fit(x, y)
sh.best_estimator_
RandomForestClassifier(max_depth=5, n_estimators=24, random_state=0)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pds
paramgrid= {'kernel': ('linear', 'rbf'),
'C': [2, 10, 100]}
base_estimator = SVC(gamma='scale')
x, y = make_classification(n_samples=1000)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
factor=2, min_resources=20).fit(x, y)
sh.n_resources_
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,
factor=2, min_resources='exhaust').fit(x, y)
sh.n_resources_
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
import pandas as pds
paramgrid = {'kernel': ('linear', 'rbf'),
'C': [2, 10, 100]}
base_estimator = SVC(gamma='scale')
x, y = make_classification(n_samples=1000)
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=6,
factor=2, max_resources=40,
aggressive_elimination=False).fit(x, y)
sh.n_resources_
sh = HalvingGridSearchCV(base_estimator, paramgrid, cv=5,
factor=2,
max_resources=40,
aggressive_elimination=True,
).fit(x, y)
sh.n_resources_
Output:
In the following output, we can see that the Scikit learn cross-validation hyperparameter which selects the ideal model that is shown on the screen.
Read: Scikit learn Hyperparameter Tuning
Scikit learn cross-validation shuffle
In this section, we will learn about scikit learn cross-validation shuffle works in python.
Cross-validation shuffle is defined as user generate train and test split firstly samples of data are shuffled and then split into the train and test set.
Code:
In the following code, we will learn to import some libraries from which we can shuffle the data and after that split it into train and test.
- x = num.array([[1, 2], [3, 4], [1, 2], [3, 4]]) is used to generate an array.
- kf = KFold(n_splits=2) is used to split the data.
- print(“TRAIN:”, train_index, “TEST:”, test_index) is used to print the train and test data.
import numpy as num
from sklearn.model_selection import KFold
x = num.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = num.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(x)
print(kf)
for train_index, test_index in kf.split(x):
print("TRAIN:", train_index, "TEST:", test_index)
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
Output:
After running the above code we get the following output in which we can see that the data shuffle after that split it into train and test data.
Read: Scikit learn hidden_layer_sizes
Scikit learn cross-validation grid search
In this section, we will learn about how Scikit learn cross-validation grid search works in python.
Cross-validation Grid search is defined as a process that selects the best parameter for all the parameterized grid models.
Code:
In the following code, we will import some libraries from which we can select the best parameter from the grid.
- iris = datasets.load_iris() is used to load the iris dataset.
- parameters = {‘kernel’:(‘linear’, ‘rbf’), ‘C’:[1, 12]} is used to define the parameters.
- classifier.fit(iris.data, iris.target) is used to fit the model.
- sorted(classifier.cv_results_.keys()) is used to sorted the classifier.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 12]}
svc = svm.SVC()
classifier = GridSearchCV(svc, parameters)
classifier.fit(iris.data, iris.target)
sorted(classifier.cv_results_.keys())
Output:
In the following output, we can see that the best parameter is shown on the screen which is searched from the parameter grid.
Also, take a look at some more Scikit learn tutorials.
- Scikit learn Genetic algorithm
- Scikit learn Classification Tutorial
- Scikit learn Gradient Descent
- Scikit learn Confusion Matrix
- Scikit learn Sentiment Analysis
- Scikit learn Pipeline + Examples
So, in this tutorial, we discussed Scikit learn cross-validation, and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.
- Scikit learn cross-validation
- Scikit learn cross-validation score
- Scikit learn Cross-validation lasso
- Scikit learn cross-validation predict
- Scikit learn cross-validation time series
- Scikit learn cross-validation split
- Scikit learn cross-validation confusion matrix
- Scikit learn cross-validation hyperparameter
- Scikit learn cross-validation shuffle
- Scikit learn cross-validation grid search
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etcâ€¦ for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.