In this Python tutorial, we will learn about scikit-learn logistic regression and we will also cover different examples related to scikit-learn logistic regression. And, we will cover these topics.
- Scikit-learn logistic regression
- Scikit-learn logistic regression standard errors
- Scikit-learn logistic regression coefficients
- Scikit-learn logistic regression p value
- Scikit-learn logistic regression feature importance
- Scikit-learn logistic regression categorical variables
- Scikit-learn logistic regression cross-validation
- Scikit-learn logistic regression threshold
Scikit-learn logistic regression
In this section, we will learn about how to work with logistic regression in scikit-learn.
- Logistic regression is a statical method for preventing binary classes or we can say that logistic regression is conducted when the dependent variable is dichotomous.
- Dichotomous means there are two possible classes like binary classes (0&1).
- Logistic regression is used for classification as well as regression. It computes the probability of an event occurrence.
Code:
Here in this code, we will import the load_digits data set with the help of the sklearn library. The data is inbuilt in sklearn we do not need to upload the data.
from sklearn.datasets import load_digits
digits = load_digits()
We can already import the data with the help of sklearn from this uploaded data from the below command we can see that there are 1797 images and 1797 labels in the dataset.
print('Image Data Shape' , digits.data.shape)
print("Label Data Shape", digits.target.shape
In the following output, we can see that the Image Data Shape value and Label Data Shape value is printing on the screen.
In this part, we will see that how our image and labels look like the images and help to evoke your data.
- plot.figure(figsize=(30,4)) is used for plotting the figure on the screen.
- for index, (image, label) in enumerate(zip(digits.data[5:10], digits.target[5:10])): is used to give the perfect size or label to the image.
- plot.subplot(1, 5, index + 1) is used to plotting the index.
- plot.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray) is used for reshaping the image.
- plot.title(‘Set: %i\n’ % label, fontsize = 30) is used to give the title to the image.
import numpy as np
import matplotlib.pyplot as plot
plot.figure(figsize=(30,4))
for index, (image, label) in enumerate(zip(digits.data[5:10], digits.target[5:10])):
plot.subplot(1, 5, index + 1)
plot.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
plot.title('Set: %i\n' % label, fontsize = 30)
After running the above code we get the following output we can see that the image is plotted on the screen in the form of Set5, Set6, Set7, Set8, Set9.
In the following code, we are splitting our data into two forms training data and testing data.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
Here we import logistic regression from sklearn .sklearn is used to just focus on modeling the dataset.
from sklearn.linear_model import LogisticRegression
In the below code we make an instance of the model. In here all parameters not specified are set to their defaults.
logisticRegression= LogisticRegression()
Above we split the data into two sets training and testing data. We can train the model after training the data we want to test the data
logisticRegression.fit(x_train, y_train)
The model can be learned during the model training process and predict the data from one observation and return the data in the form of an array.
logisticRegression.predict(x_test[0].reshape(1,-1)
In the following output, we see the NumPy array is returned after predicting for one observation.
From the below code we can predict that multiple observations at once.
logisticRegression.predict(x_test[0:10])
From this code, we can predict the entire data.
logisticRegression.predict(x_test[0:10])
After training and testing our model is ready or not to find that we can measure the accuracy of the model we can use the scoring method to get the accuracy of the model.
predictions = logisticRegression.predict(x_test)
score = logisticRegression.score(x_test, y_test)
print(score)
In this output, we can get the accuracy of a model by using the scoring method.
Also, check: Scikit learn Decision Tree
Scikit-learn logistic regression standard errors
As we know logistic regression is a statical method for preventing binary classes and we know the logistic regression is conducted when the dependent variable is dichotomous.
Here we can work on logistic standard error. The standard error is defined as the coefficient of the model are the square root of their diagonal entries of the covariance matrix.
Code:
In the following code, we will work on the standard error of logistic regression as we know the standard error is the square root of the diagonal entries of the covariance matrix.
from sklearn.metrics import mean_squared_error
y_true = [4, -0.6, 3, 8]
y_pred = [3.5, 0.1, 3, 9]
mean_squared_error(y_true, y_pred)
0.475
y_true = [4, -0.6, 3, 8]
y_pred = [3.5, 0.1, 3, 9]
mean_squared_error(y_true, y_pred, squared=False)
0.712
y_true = [[0.6, 2],[-2, 2],[8, -7]]
y_pred = [[1, 3],[-1, 3],[7, -6]]
mean_squared_error(y_true, y_pred)
0.808
mean_squared_error(y_true, y_pred, squared=False)
0.922
mean_squared_error(y_true, y_pred, multioutput='raw_values')
array=([0.51666667, 2])
mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.925
Output:
After running the above code we get the following output in which we can see that the error value is generated and seen on the screen.
Read: Scikit learn Random Forest
Scikit-learn logistic regression coefficients
In this section, we will learn about how to work with logistic regression coefficients in scikit-learn.
The coefficient is defined as a number in which the value of the given term is multiplied by each other. Here the logistic regression expresses the size and direction of a variable.
Code:
In the following code, we are importing the libraries import pandas as pd, import numpy as np, import sklearn as sl.
- The panda library is used for data manipulation and numpy is used for working with arrays.
- The sklearn library is used for focusing on the modelling data not focusing on manipulating the data.
- x = np.random.randint(0, 7, size=n) is used for generating the random function.
- res_sd = sd.Logit(y, x).fit(method=”ncg”, maxiter=max_iter) is used for performing different statical task.
- print(res_sl.coef_) is used for printing the coefficient on the screen.
import pandas as pd
import numpy as np
import sklearn as sl
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sd
n = 250
x = np.random.randint(0, 7, size=n)
y = (x > (0.10 + np.random.normal(0, 0.10, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 150
res_sd = sd.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sd.params)
res_sl = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sl.fit( x.reshape(n, 1), y )
print(res_sl.coef_)
Output:
After running the above code we get the following output in which we can see that the scikit learn logistic regression coefficient is printed on the screen.
Read: Scikit learn Feature Selection
Scikit-learn logistic regression p value
In this section, we will learn about how to calculate the p-value of logistic regression in scikit learn.
Logistic regression pvalue is used to test the null hypothesis and its coefficient is equal to zero. The lowest pvalue is <0.05 and this lowest value indicates that you can reject the null hypothesis.
Code:
In the following code, we will import library import numpy as np which is working with an array.
- In this firstly we calculate z-score for scikit learn logistic regression.
- def logit_p1value(model, x): In this, we use some parameters Like model and x.
- model: is used for fitted sklearn.linear_model.LogisticRegression with intercept and large C
- x: is used as a matrix on which the model was fit.
- model = LogisticRegression(C=1e30).fit(x, y) is used to test the pvalue.
- print(logit_pvalue(model, x)) after testing the value further the value is printed on the screen by this method.
- sd_model = sd.Logit(y, sm.add_constant(x)).fit(disp=0) is used for comparing the pvalue with statmodels.
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_p1value(model, x):
p1 = model.predict_proba(x)
n1 = len(p1)
m1 = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
answ = np.zeros((m1, m1))
for i in range(n1):
answ = answ + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p1[i,1] * p1[i, 0]
vcov = np.linalg.inv(np.matrix(answ))
se = np.sqrt(np.diag(vcov))
t1 = coefs/se
p1 = (1 - norm.cdf(abs(t1))) * 2
return p1
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
import statsmodels.api as sd
sd_model = sd.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sd_model.pvalues)
sd_model.summary()
Output:
After running the above code we get the following output in which we can see that logistic regression p-value is created on the screen.
Scikit-learn logistic regression feature importance
In this section, we will learn about the feature importance of logistic regression in scikit learn.
Feature importance is defined as a method that allocates a value to an input feature and these values which we are allocated based on how much they are helpful in predicting the target variable.
Code:
In the following code we will import LogisticRegression from sklearn.linear_model and also import pyplot for plotting the graphs on the screen.
- x, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1) is used to define the dtatset.
- model = LogisticRegression() is used for defining the model.
- model.fit(x, y) is used to fit the model.
- imptance = model.coef_[0] is used to get the importance of the feature.
- pyplot.bar([X for X in range(len(imptance))], imptance) is used for plot the feature importance.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
x, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)
model = LogisticRegression()
model.fit(x, y)
imptance = model.coef_[0]
for i,j in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,j))
pyplot.bar([X for X in range(len(imptance))], imptance)
pyplot.show()
Output:
After running the above code we get the following output in which we can see that logistic regression feature importance is shown on the screen.
Also, read: Scikit-learn Vs Tensorflow – Detailed Comparison
Scikit-learn logistic regression categorical variables
In this section, we will learn about the logistic regression categorical variable in scikit learn.
As the name suggests, divide the data into different categories or we can say that a categorical variable is a variable that assigns individually to a particular group of some basic qualitative property.
Code:
In the following code, we will import some libraries such as import pandas as pd, import NumPy as np also import copy. Pandas are used for manipulating and analyzing the data and NumPy is used for supporting the multiple arrays.
import pandas as pd
import numpy as np
import copy
%matplotlib inline
Here we can upload the CSV data file for getting some data of customers.
df_data.head() is used to show the first five rows of the data inside the file.
df_data = pd.read_csv('data.csv')
df_data.head()
In the following output, we can see that we get the first five-row from the dataset which is shown on the screen.
print(df_data.info()) is used for printing the data information on the screen.
print(df_data.info())
Boxplot is produced to display the whole summary of the set of data.
df_data.boxplot('dep_time','origin',rot = 30,figsize=(5,6))
Here .copy() method is used if any change is done in the data frame and this change does not affect the original data.
cat_df_data = df_data.select_dtypes(include=['object']).copy()
.hed() function is used to check if you have any requirement to fil
cat_df_data.head()
Here we use these commands to check the null value in the data set. From this, we can get thethe total number of missing values.
print(cat_df_data.isnull().values.sum())
This checks the column-wise distribution of the null value.
print(cat_df_data.isnull().sum())
.value_count() method is used for returning the frequency distribution of each category.
cat_df_data = cat_df_data.fillna(cat_df_data['tailnum'].value_counts().index[0])
Now we can again check the null value after assigning different methods the result is zero counts.
print(cat_df_data.isnull().values.sum())
.value_count() method is used for the frequency distribution of the category of the categorical feature.
print(cat_df_data['carrier'].value_counts())
This is used to count the distinct category of features.
print(cat_df_data['carrier'].value_counts().count())
- sns.barplot(carrier_count.index, carrier_count.values, alpha=0.9) is used to plot the bar graph.
- plt.title(‘Frequency Distribution of Carriers’) is used to give the title to the bar plot.
- plt.ylabel(‘Number of Occurrences’, fontsize=12) is used to give the label to the y axis.
- plt.xlabel(‘Carrier’, fontsize=12) is used to give the label to the x-axis .
import seaborn as sns
import matplotlib.pyplot as plt
carrier_count = cat_df_data['carrier'].value_counts()
sns.set(style="darkgrid")
sns.barplot(carrier_count.index, carrier_count.values, alpha=0.9)
plt.title('Frequency Distribution of Carriers')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Carrier', fontsize=12)
plt.show()
In this picture, we can see that the bar chart is plotted on the screen.
- labels = cat_df_data[‘carrier’].astype(‘category’).cat.categories.tolist() is used to give the labels to the chart.
- sizes = [counts[var_cat] for var_cat in labels] is used to give the size to pie chart.
- fig1, ax1 = plt.subplots() is used to plot the chart.
labels = cat_df_data['carrier'].astype('category').cat.categories.tolist()
counts = cat_df_data['carrier'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()
In the following output, we can see that a pie chart is plotted on the screen in which the values are divided into categories.
Read: Scikit learn Sentiment Analysis
Scikit-learn logistic regression cross-validation
In this section, we will learn about logistic regression cross-validation in scikit learn.
- As we know scikit learn library is used for focused on modeling data. It just focused on modeling the data not loading the data.
- Here the use of scikit learn we also create the result of logistic regression cross-validation.
- Cross-validation is a method that uses the different positions of data for the testing train and test models on different iterations.
Code:
In the following code, we import different libraries for getting the accurate value of logistic regression cross-validation.
- x, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) is used for creating the dataset.
- CV = KFold(n_splits=10, random_state=1, shuffle=True) is used for preparing the cross validation procedure.
- model = LogisticRegression() is used for creating a model.
- score = cross_val_score(model, x, y, scoring=’accuracy’, cv=CV, n_jobs=-1) is used for evaluating the model.
- print(‘Accuracy: %.3f (%.3f)’ % (mean(score), std(score))) is used preparing report performance.
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
x, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
CV = KFold(n_splits=10, random_state=1, shuffle=True)
model = LogisticRegression()
score = cross_val_score(model, x, y, scoring='accuracy', cv=CV, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(score), std(score)))
Output:
After running the above code we get the following output in which we can see that the accuracy of cross-validation is shown on the screen.
Scikit-learn logistic regression threshold
In this section, we will learn about How to get the logistic regression threshold value in scikit learn.
- As we know logistic regression is a statical method of preventing binary classes. Binary classes are defined as 0 or 1 or we can say that true or false.
- Here logistic regression assigns each row as a probability of true and makes a prediction if the value is less than 0.5 its take value as 0.
- The default value of the threshold is 0.5.
Code:
In the following code, we will import different methods from which we the threshold of logistic regression. The default value of the threshold is 0.5 and if the value of the threshold is less than 0.5 then we take the value as 0.
- X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) is used to generate the dataset.
- trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) is used to split the data into train and test.
- models.fit(trainX, trainy) is used fit the model.
- yhat = model.predict_proba(testX) is used to predict the probability.
- yhat = yhat[:, 1] is used to keep the probability for positive outcome only.
- fpr, tpr, thresholds = roc_curve(testy, yhat) is used to calculate the roc curve.
from numpy import argmax
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
models = LogisticRegression(solver='lbfgs')
models.fit(trainX, trainy)
yhat = model.predict_proba(testX)
yhat = yhat[:, 1]
fpr, tpr, thresholds = roc_curve(testy, yhat)
Jt = tpr - fpr
ix = argmax(Jt)
best_threshold = thresholds[ix]
print('Best Threshold=%f' % (best_threshold))
Output:
After running the above code we get the following output in which we can see the value of the threshold is printed on the screen.
So, in this tutorial, we discussed scikit learn logistic regression and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.
- Scikit-learn logistic regression
- Scikit-learn logistic regression standard errors
- Scikit-learn logistic regression coefficients
- Scikit-learn logistic regression p value
- Scikit-learn logistic regression feature importance
- Scikit-learn logistic regression categorical variables
- Scikit-learn logistic regression cross-validation
- Scikit-learn logistic regression threshold
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.