Scikit learn Split Data

In this Python tutorial, we will learn how Scikit learn Split data works in Python. And we will also cover different examples related to Scikit learn Split data. Moreover, we will cover these topics.

  • Scikit learn Split data
  • Scikit learn Split train test index
  • Scikit learn Split by group
  • Scikit learn Split K fold
  • Scikit learn Split data strategy
  • Scikit learn Split time series
  • Scikit learn Split train test Val

If you are new to Scikit learn, we recommend you to read What is Scikit Learn in Python.

Scikit learn Split data

In this section, we will learn about how Scikit learn Split data works in python.

Scikit learn split data frame is used to split the data into train and test dataset the split() function is used to split the data it calls the input data for splitting data.

Code:

In the following code, we import some libraries from which we can spit the data frame into train and test datasets.

  • x, y = num.arange(10).reshape((5, 2)), range(5) is used to arrange the data.
  • array=() is used to define the array.
  • list(y) is used to print the list of data on the screen.
  • x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42) is used to split the dataframe into train and test data set.
  • train_test_split(y, shuffle=False) is used to split the data.
import numpy as num
from sklearn.model_selection import train_test_split
x, y = num.arange(10).reshape((5, 2)), range(5)
x
array=([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8],
       [9, 10]])
list(y)
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.33, random_state=42)

x_train
train_test_split(y, shuffle=False)

Output:

After running the above code, we get the following output in which we can see that the data frame is split into train and test data set.

Scikit learn split data
Scikit learn split data

Read: Scikit learn Image Processing

Scikit learn Split train test index

In this section, we will learn about how Scikit learn Split train test intext works in python.

Scikit learn split train test index is used to split the train test data into train test index to find the train test split. The split() function is used to split the data into a train text index.

Code:

In the following code, we will import some libraries from which we can split the train test index split.

  • x = num.array([[2, 3], [4, 5], [6, 7], [8, 9], [4, 5], [6, 7]]) is used to create the array.
  • randomshuffle = ShuffleSplit(n_splits=5, test_size=.25, random_state=0) is used to split the data.
  • for train_index, test_index in randomshuffle.split(x): is used to split the dataset into train test index.
  • print(“TRAIN:”, train_index, “TEST:”, test_index) is used to print the train test index data.
import numpy as num
from sklearn.model_selection import ShuffleSplit
x = num.array([[2, 3], [4, 5], [6, 7], [8, 9], [4, 5], [6, 7]])
y = num.array([1, 2, 1, 2, 1, 2])
randomshuffle = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
randomshuffle.get_n_splits(x)
print(randomshuffle)
for train_index, test_index in randomshuffle.split(x):
       print("TRAIN:", train_index, "TEST:", test_index)

Output:

After running the above code, we get the following output in which we can see that the data split into train and test index split.

scikit learn split train test index
scikit learn split train test index

Read: Scikit learn non-linear

Scikit learn Split by group

In this section, we will learn about how Scikit learn split by group works in python.

  • Scikit learn split group by is used to split the data and divide the data into groups.
  • We can use the train_test_split() function from which we can split the data into train and test sets.

Code:

In the following code, we import some libraries from which we can split the data by group.

  • iris = load_iris() is used to load the iris data.
  • x = iris.data is used to import the value of x.
  • x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.6) is used to split the data into train and test dataset.
  • print(Counter(y_train)) is used to print the value of the y train group.
  • print(Counter(y_test)) is used to print the data of the y test group.
from collections import Counter
import numpy as num

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

iris = load_iris()

x = iris.data
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.6)

print(Counter(y_train))

print(Counter(y_test))

Output:

In the following output, we can see that the y train group data and y test group data are printed on the screen.

Scikit learn split by group
Scikit learn split by group

Read: Scikit learn KNN Tutorial

Scikit learn Split K fold

In this section, we will learn about how Scikit learn split Kfold works in python.

  • Scikit learn split Kfold is used to split the data into K consecutive fold by default without being shuffled by the data.
  • The dataset is split into two parts train data and test data with the help of the train_test_split() method.

Code:

In the following code, we will import some libraries from which we can split the dataset into K consecutive folds.

  • num.random.seed(1338) is used to generate the random numbers.
  • n_splits = 6 is used to split the data into six parts.
  • percentiles_classes = [0.1, 0.3, 0.6] is used to generate the group data.
  • group = num.hstack([[ii] * 10 for ii in range(10)])is used to evenly split the groups.
  • figure, axis = plot.subplots() is used to plot the figure.
  • axis.set() is used to set the axis on the screen.
from sklearn.model_selection import (
    TimeSeriesSplit,
    KFold,
    ShuffleSplit,
    StratifiedKFold,
    GroupShuffleSplit,
    GroupKFold,
    StratifiedShuffleSplit,
    StratifiedGroupKFold,
)
import numpy as num
import matplotlib.pyplot as plot
from matplotlib.patches import Patch

num.random.seed(1338)
cmap_data = plot.cm.Paired
cmap_cv = plot.cm.coolwarm
n_splits = 6

n_points = 100
x = num.random.randn(100, 10)

percentiles_classes = [0.1, 0.3, 0.6]
y = num.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])


group = num.hstack([[ii] * 10 for ii in range(10)])


def visualizegroup(classes, group, name):

    figure, axis = plot.subplots()
    axis.scatter(
        range(len(group)),
        [0.5] * len(group),
        c=group,
        marker="_",
        lw=60,
        cmap=cmap_data,
    )
    axis.scatter(
        range(len(group)),
        [3.5] * len(group),
        c=classes,
        marker="_",
        lw=60,
        cmap=cmap_data,
    )
    axis.set(
        ylim=[-1, 5],
        yticks=[0.5, 3.5],
        yticklabels=["Data\ngroup", "Data\nclass"],
        xlabel="Sample index",
    )


visualizegroup(y, group, "no groups")

Output:

In the following output, we can see that the data set is split into K consecutive fold by default without any shuffling of data.

Scikit learn split K fold
Scikit learn split K fold

Read: Scikit learn Sentiment Analysis

Scikit learn Split data strategy

In this section, we will learn about how Scikit learn split data strategy works in python.

  • Scikit learn split data strategy is used to split the dataset into train data and test data.
  • The training data is used to fit the data into the model and test data is used to evaluate the fit data.
  • We can split the train and test data with the help of the train_test_split() method.

Code:

In the following code, we will import some libraries from which we can split the data strategy.

  • range = num.random.RandomState(0)is used to generate the random numbers.
  • y = range.poisson(lam=np.exp(x[:, 5]) / 2) is used of positive integer target correlated with many zeros.
  • x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=range) is used to split the data into train and test data.
  • glm.fit(x_train, y_train) is used to fit the data.
  • print(glm.score(x_test, y_test)) is used to print the score.
  • numproc = make_pipeline(SimpleImputer(strategy=”median”), StandardScaler()) is used to make the pipeline from dataset.
  • gbdt_no_cst = HistGradientBoostingRegressor().fit(x, y) is used to fit the histgradient boosting regression model.
  • display = plot_partial_dependence() is used to plot the data on the graph.
  • display.axes_[0, 0].plot() is used to display the axes on the screen.
import numpy as num
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

nsamples, nfeatures = 1000, 20
range = num.random.RandomState(0)
x = range.randn(nsamples, nfeatures)

y = range.poisson(lam=np.exp(x[:, 5]) / 2)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=range)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.01)
glm.fit(x_train, y_train)
gbdt.fit(x_train, y_train)
print(glm.score(x_test, y_test))
print(gbdt.score(x_test, y_test))
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

set_config(display="diagram")

numproc = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

catproc = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = make_column_transformer(
    (numproc, ("feat1", "feat3")), (catproc, ("feat0", "feat2"))
)

classifier = make_pipeline(preprocessor, LogisticRegression())
classifier
import scipy
import numpy as num
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score

range = num.random.RandomState(0)
x, y = make_blobs(random_state=range)
x = scipy.sparse.csr_matrix(x)
x_train, x_test, _, y_test = train_test_split(x, y, random_state=range)
kmeans = KMeans(algorithm="elkan").fit(x_train)
print(completeness_score(kmeans.predict(x_test), y_test))
import numpy as num
from matplotlib import pyplot as plot
from sklearn.model_selection import train_test_split
from sklearn.inspection import plot_partial_dependence
from sklearn.ensemble import HistGradientBoostingRegressor

nsamples = 500
range = num.random.RandomState(0)
x = range.randn(nsamples, 2)
noise = range.normal(loc=0.0, scale=0.01, size=nsamples)
y = 5 * x[:, 0] + num.sin(10 * num.pi * x[:, 0]) - noise

gbdt_no_cst = HistGradientBoostingRegressor().fit(x, y)
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(x, y)

display = plot_partial_dependence(
    gbdt_no_cst,
    x,
    features=[0],
    feature_names=["feature 0"],
    line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:red"},
)
plot_partial_dependence(
    gbdt_cst,
    x,
    features=[0],
    line_kw={"linewidth": 4, "label": "constrained", "color": "tab:cyan"},
    ax=display.axes_,
)
display.axes_[0, 0].plot(
    x[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:orange"
)
display.axes_[0, 0].set_ylim(-3, 3)
display.axes_[0, 0].set_xlim(-1, 1)
plot.legend()
plot.show()

Output:

After running the above code, we get the following output, we can see that the dataset is split with their data strategy.

Scikit learn split data strategy
Scikit learn split data strategy

Read: Scikit learn Gradient Descent

Scikit learn Split time series

In this section, we will learn about how Scikit learn split time series works in python.

Scikit learn split time series is used the train and test data to split the time at a fixed time interval.

Code:

In the following code, we will import some libraries from which we can split times series data.

  • figure, axis = plot.subplots(figsize=(14, 6)) is used to plot the figure.
  • averageweek_demand = dataframe.groupby([“weekday”, “hour”]).mean()[“count”] is used to count the average week demand.
  • averageweek_demand.plot(ax=axis)= axis.set() is used to plot the axis on the graph.
  • timeseries_cv = TimeSeriesSplit() is used to split the timeseries data.
  • X.iloc[test_0] is used to select the data by position.
from sklearn.datasets import fetch_openml

bikesharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
dataframe = bikesharing.frame
import matplotlib.pyplot as plot


figure, axis = plot.subplots(figsize=(14, 6))
averageweek_demand = dataframe.groupby(["weekday", "hour"]).mean()["count"]
averageweek_demand.plot(ax=axis)
_ = axis.set(
    title="Average Bike Demand During the week",
    xticks=[i * 24 for i in range(7)],
    xticklabels=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
    xlabel="Time of the week",
    ylabel="Number of bike rentals",
)
y = dataframe["count"] / dataframe["count"].max()
figure, axis = plot.subplots(figsize=(14, 6))
y.hist(bins=30, ax=axis)
_ = axis.set(
    xlabel="Fraction of rented fleet demand",
    ylabel="Number of hours",
)
x = dataframe.drop("count", axis="columns")
x
from sklearn.model_selection import TimeSeriesSplit

timeseries_cv = TimeSeriesSplit(
    n_splits=7,
    gap=48,
    max_train_size=10000,
    test_size=1000,
)
allsplits = list(timeseries_cv.split(x, y))
train_0, test_0 = allsplits[0]
X.iloc[test_0]

Output:

After running the above code, we get the following output in which we can see that the data split time series is done on the screen.

Scikit learn split time series
Scikit learn split time series

Read: Scikit learn Genetic algorithm

Scikit learn Split train test Val

In this section, we will learn how Scikit learn slit train test Val works in python.

  • Scikit learn split train test val is used to split the data set into train and test data and get the value of train test split data.
  • The training data is used to fit the data into the model and the test data is used to evaluate the fit data.

Code:

In the following code, we will import some libraries from which we can split the train test val.

  • x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0) is used to split the data set into train test data.
  • figure, axis = plot.subplots() is used to plot the figure or axis on the graph.
  • axis.set_xlabel(“Effective Alpha”) is used to plot the x label on the graph.
  • axis.set_title(“Total Impurity vs effective alpha for training set”) is used to plot the title on the screen.
import matplotlib.pyplot as plot
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
x, y = load_breast_cancer(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

classifier = DecisionTreeClassifier(random_state=0)
path = classifier.cost_complexity_pruning_path(x_train, y_train)
ccpalphas, impurities = path.ccp_alphas, path.impurities
figure, axis = plot.subplots()
axis.plot(ccpalphas[:-2], impurities[:-2], marker="o", drawstyle="steps-post")
axis.set_xlabel("Effective Alpha")
axis.set_ylabel("Total Impurity Of Leaves")
axis.set_title("Total Impurity vs effective alpha for training set")

Output:

After running the above code we get the following output in which we can see that the graph is plotted on the screen and also get the split train test val.

Scikit learn split train test val
Scikit learn split train test val

You may also like to read the following Scikit learn tutorials.

So, in this tutorial we discussed Scikit learn Split data and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.

  • Scikit learn Split data
  • Scikit learn Split train test index
  • Scikit learn Split by group
  • Scikit learn Split K fold
  • Scikit learn Split data strategy
  • Scikit learn Split time series
  • Scikit learn Split train test Val