Scikit learn Decision Tree

In this Python tutorial, we will learn How to create a scikit learn decision tree in Python and we will also cover different examples related to the decision tree. Additionally, we will cover these topics.

  • Scikit learn decision tree
  • Scikit learn decision tree classifier
  • Scikit learn decision tree classifier example
  • Scikit learn decision tree regressor
  • Scikit learn decision tree visualization
  • Scikit learn decision tree pruning
  • Scikit learn decision tree categorical
  • Scikit learn decision tree accuracy

Scikit learn decision tree

In this section, we will learn about How to make a scikit-learn decision tree in python.

  • A decision tree is a flowchart-like tree structure it consists of branches and each branch represents the decision rule. The branches of a tree are known as nodes.
  • We have a splitting process for dividing the node into subnodes. The topmost node of the decision tree is known as the root node.
  • There are the lines that spit the nodes into sub-nodes and the subnode is even divided into even subnodes then initial subnodes call the decision node.
  • The node which does not spit further is called leaf or terminal node. The subsection of the entire tree is known as branch or subtree.
  • We can also call the node as parent and child node. A node that is divided into subnodes is called a parent node where a subnode will be called a child of the parent node.
Scikit learn decision tree
Scikit learn decision tree

As we see in the above picture the node is split into sub-nodes.We can also select the best split point in the decision tree.

  • The decision tree splits the nodes on all the available variables.
  • Selects the splits which result in the most homogenous sub-nodes.
  • The time complexity of the decision tree is a method of the number of records and the number of attributes in the given data.
  • The decision tree is non parametric method which does not depend upon the probability distribution.

Also, check: Scikit-learn logistic regression

Scikit learn decision tree classifier

In this section, we will learn about how to create a scikit learn decision tree classifier in python.

  • A decision tree is used for predicting the value and it is a nonparametric supervised learning method used for classification and regression.
  • A decision tree classifier is a class that can use for performing the multiple class classification on the dataset.
  • The decision tree classifiers take input of two arrays such as array X and array Y. An array X is holding the training samples and array Y is holding the training sample.
  • A decision tree classifier support binary classification as well as multiclass classification.

Code:

In the following code, we will load the iris data from the sklearn library and also import the tree from sklearn.

  • load_iris() is used to load the dataset.
  • X, Y = iris.data, iris.target is used for train data and test data.
  • tree.DecisionTreeClassifier() is used for making the decision tree classifier.
  • tree.DecisionTreeClassifier() is used to fit the data inside the tree.
  • tree.plot_tree(clasifier) is used to plot the decision tree on the screen.
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, Y = iris.data, iris.target
clasifier = tree.DecisionTreeClassifier()
clasifier = clasifier.fit(X, Y)
tree.plot_tree(clasifier)

Output:

After running the above code we get the following output in which we can see that the decision tree is plotted on the screen.

scikit learn decision tree classifier
scikit learn decision tree classifier

Read: Scikit-learn Vs Tensorflow

Scikit learn decision tree classifier example

In this section, we will learn about How to make a scikit learn decision tree example in Python.

  • As we know decision tree is used for predicting the value and it is non-parametric supervised learning.
  • The decision tree classifiers take input of two arrays such as array X and array Y. An array X is holding the training samples and array Y is holding the training sample.

Example:

In the following example, we will import graphviz library. Graphviz is defined as an open-source module that is used to create graphs.

  • tree.export_graphviz(clf, out_file=None) is used to create a tree graph on the screen.
  • tree.export_graphviz() is used to add some variables inside the tree graph.
  • graphviz.Source(dotdata) is used to get data from the source.
import graphviz 
dotdata = tree.export_graphviz(clf, out_file=None) 
graphs = graphviz.Source(dotdata) 
graphs.render("iris")
dotdata = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graphs = graphviz.Source(dotdata)  
graphs 

Output:

After running the above code we get the following output in which we can see that a decision tree graph is drawn with the help of Graphviz.

Scikit learn decision tree classifier example
Scikit learn decision tree classifier example

Read Scikit learn accuracy_score

Scikit learn decision tree regressor

Before moving forward we should have a piece of knowledge about regressors.

  • In Regressor we just predict the values or we can say that it is a modeling technique that investigates the relationship between dependent and independent variables.
  • In regressor, we have dependent and independent variables. Here dependent variables work as responses and the independent variable works as features.
  • Regressor helps us to understand how the value of the dependent variable is changing corresponding to an independent variable.

Code:

In the following code, we will import some library import numy as np, from sklearn.tree from import DecisionTreeRegressor and import matplotlib.pyplot as plot.

  • np.random.RandomState(1) is used to create a random dataset.
  • regression_1.fit(X, y) is used to fil the regression model.
  • regression_1.predict(X_test) is used to predict the data.
  • plot.figure() is used to plot the figures.
  • plot.xlabel(“data”) is used to plot the x label on the graph.
  • plot.ylabel(“target”) is used to plot the y label on the graph.
  • plot.title(“Decision Tree Regression”) is used to give the title to the graph.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plot


range = np.random.RandomState(1)
X = np.sort(5 * range.rand(80, 1), axis=0)
Y = np.sin(X).ravel()
Y[::5] += 3 * (0.5 - range.rand(16))


regression_1 = DecisionTreeRegressor(max_depth=2)
regression_2 = DecisionTreeRegressor(max_depth=5)
regression_1.fit(X, y)
regression_2.fit(X, y)


X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
Y1 = regression_1.predict(X_test)
Y2 = regression_2.predict(X_test)


plot.figure()
plot.scatter(X, y, s=20, edgecolor="black", c="pink", label="data")
plot.plot(X_test, Y1, color="blue", label="max_depth=4", linewidth=2)
plot.plot(X_test, Y2, color="green", label="max_depth=7", linewidth=2)
plot.xlabel("data")
plot.ylabel("target")
plot.title("Decision Tree Regression")
plot.legend()
plot.show()

Output:

After running the above code we get the following output in which we can see that the decision tree regressor is plotted. The line in the green is actual data and the dotted line is our predicted data.

scikit learn decision tree regressor
scikit learn decision tree regressor

Read Scikit learn Hierarchical Clustering

Scikit learn decision tree visualization

In this section, we will learn about How to make scikit learn decision tree visualization in python.

Before moving forward we should have some piece of knowledge bout the visualization.

  • Visualization is defined as a process of converting a large dataset into the form of graphs, charts, or trees.
  • A decision tree visualization also converted a large data into a tree format from which the user can easily understand in better manner.
  • Adecision tree visualization is done using sklearn tree method, Plot_tree.sklearn IRIS dataset.

Code:

In the following code, we will import some libraries import matplotlib.pyplot as plot, from sklearn import dataset, from sklearn.model_selection import train_test_split, from sklearn.tree import DecisionTreeClassifier.

  • iris = datasets.load_iris() is used for loading the iris dataset.
  • X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=1, stratify=Y) is used for creating the train or test dataset.
  • classifier_tree = DecisionTreeClassifier(criterion=’gini’, max_depth=6, random_state=1) is used fortrain the model using DecisionTreeClassifier.
  • tree.plot_tree(classifier_tree, fontsize=12) is used for plotting the decision tree.
import matplotlib.pyplot as plot
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

iris = datasets.load_iris()
X = iris.data[:, 2:]
Y = iris.target

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=1, stratify=Y)

classifier_tree = DecisionTreeClassifier(criterion='gini', max_depth=6, random_state=1)
classifier_tree.fit(X_train, Y_train)

figure, axis = plot.subplots(figsize=(12, 12))
tree.plot_tree(classifier_tree, fontsize=12)
plot.show()

Output:

After running the above code we get the following output in which we can see that scikit learn decision tree visualization is drawn on the screen.

Scikit learn decision tree visualization
Scikit learn decision tree visualization

Read: Scikit learn Hidden Markov Model

Scikit learn decision tree pruning

In this section, we will learn about How to make scikit learn decision tree punning in python.

Pruning is defined as a data compress technique in which the data is shrinking and the size of the decision tree algorithm is reduced by just removing the section of the tree.

Code:

In the following code, we import some libraries import matplotlib.pyplot as plt, from sklearn.model_selction import train_test_split, from sklearn.tree import load_breast cancer.

  • load_breast_cancer(return_X_y=True) is used to load the data of breast cancer.
  • X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) is used to split the train and test data.
  • DecisionTreeClassifier(random_state=0) is used to draw the random state of decision tree classifier.
  • axis.plot(ccp_alphas[:-1], impurities[:-1], marker=”o”, drawstyle=”steps-post”) is used to plot the axis.
  • axis.set_xlabel(“Effective alpha”) is used to give the x label to the graph.
  • axis.set_ylabel(“Total impurity of leaves”) is used to give y label to the graph.
  • axis.set_title(“Total Impurity vs effective alpha for training set”) is used to give the title to the graph.
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clasifier = DecisionTreeClassifier(random_state=0)
path = clasifier.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, axis = plt.subplots()
axis.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
axis.set_xlabel("Effective alpha")
axis.set_ylabel("Total impurity of leaves")
axis.set_title("Total Impurity vs effective alpha for training set")

Output:

After running the above code we get the following output in which we can see the total impurities of the leaves.

scikit learn decision tree punning
scikit learn decision tree punning

Also, check: Scikit learn Random Forest

Scikit learn decision tree categorical

In this section, we will learn about how to make scikit learn decision tree categorical in Python.

  • Categorical are the data types that are equivalent to the categorical variable. It takes a fixed number of possible values.
  • Categorical variables can divide the variable into different categories such as gender, type, class,

Code:

In the following code, we will import some libraries from matplotlib import pyplot as plt, from sklearn import dataset, from sklearn.tree import DecisionTreeClassifier.

  • We can collect the data from that iris dataset and divide data into categories.
  • tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names, filled=True) is used to plot the data on the screen.
from matplotlib import pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
iris = datasets.load_iris()
X = iris.data
y = iris.target
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True)

Output:

After running the above code we get the following output in which we can see that the data is divided into different categories.

scikit-learn decision tree categorical
scikit-learn decision tree categorical

Read: Scikit learn Linear Regression

Scikit learn decision tree accuracy

In this section, we will learn about how to make scikit decision tree accuracy in python.

  • CAccuracy is used to measure the performance of the model in measuring the sum of true positives and true negatives.
  • Accuracy is defined as the number of correctly classified cases to the total of cases under evaluation.

Code:

In the following code, we will import some libraries to find the accuracy of the decision tree.

data = pd.read_csv(“diabetes.csv”, header=None, names=col_names) is used to read the data from dataset.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split  
from sklearn import metrics 
 col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

data = pd.read_csv("diabetes.csv", header=None, names=col_names)

From data.head() function we get the first five-row of the data set.

data.head()

As shown in this picture we get the first five rows from the data set from data.head() function.

scikit learn decision tree accuracy dataset

In the following code, we split the data and target variables to get the accuracy.

print(“Accuracy:”,metrics.accuracy_score(y_train, y_pred)) is used to print the accuracy of data on the screen.


feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = data[feature_cols] 
y = data.label 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 
clf = DecisionTreeClassifier()


clf = clf.fit(X_train,y_train)


y_pred = clf.predict(X_train)

print("Accuracy:",metrics.accuracy_score(y_train, y_pred))

After running the above code we get the following output in which we can see the accuracy of the model.

scikit learn decision tree accuracy
scikit learn decision tree accuracy

You may also like to read the following Scikit learn tutorials.

So, in this tutorial we discussed Scikit learn decision tree and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.

  • Scikit learn decision tree
  • Scikit learn decision tree classifier
  • Scikit learn decision tree classifier example
  • Scikit learn decision tree regressor
  • Scikit learn decision tree visualization
  • Scikit learn decision tree pruning
  • Scikit learn decision tree categorical
  • Scikit learn decision tree accuracy