# Scikit learn Sentiment Analysis

In this Python tutorial, we will learn how Scikit learn sentiment analysis works in python, and we will cover different examples related to sentimental analysis. Also, we will cover these topics.

• Scikit learn sentiment analysis
• Scikit learn sentiment classification
• Scikit learn sentiment logistic regression

## Scikit learn sentiment analysis

In this section, we will learn about how Scikit learn sentiment analysis works in python.

• Sentiment analysis is defined as a process and a most important part of natural language processing.
• In converting the text data into numerical data because the text data cannot be processed by an algorithm.
• When the text data is converted into numerical data it can be processed by an algorithm directly.

Code:

In the following code, we will import some libraries from which we can make a sentiment analysis.

• data.head() is used to print the first five rows of the dataset.
``````import pandas as pds
import numpy as num
import matplotlib.pyplot as plot
import seaborn as sb
from sklearn.feature_extraction.text import CountVectorizer
count=CountVectorizer()

After running the above code, we get the following output in which we can see the first five-row of the dataset.

The data.shape function is used to print the shape of the dataset.

``data.shape``
• fig=plt.figure(figsize=(5,5)) is to plot the figure on the screen.
• clrs=[“green”, ‘red’] is used to give the color to graph.
• positive=data[data[‘label’]==1] is used give the positive label to graph.
• negative=data[data[‘label’]==0] is used to give the negative label to graph.
• piechart=plot.pie(cv,labels=[“Positive”,”Negative”],autopct =’%1.1f%%’,shadow = True, colors = clrs,startangle = 45,explode=(0, 0.1)) is used to plot the pie chart on the green.
• dataframe=[“Hello master, Dont carry the world upon your shoulders for well you know that its a fool who plays it cool by making his world a little colder Na-na-na,a, na”] is used as a dataframe.
• bagview=count.fit_transform(dataframe) is used to transform the text data into the numerical data.
• print(count.get_feature_names()) is used to print the feature name on the screen.
``````fig=plt.figure(figsize=(5,5))
clrs=["green",'red']
positive=data[data['label']==1]
negative=data[data['label']==0]
cv=[positive['label'].count(),negative['label'].count()]
piechart=plot.pie(cv,labels=["Positive","Negative"],
autopct ='%1.1f%%',
colors = clrs,
startangle = 45,
explode=(0, 0.1))
dataframe=["Hello master, Dont carry the world upon your shoulders for well you know that its a fool who plays it cool by making his world a little colder Na-na-na,a, na"]
bagview=count.fit_transform(dataframe)
print(count.get_feature_names())
print(bagview.toarray())``````

After running the above code, we get the following output in which we can see that the text data is converted to numerical data. And also, see the pie chart with positive data and negative data is shown on the screen.

## Scikit learn sentiment classification

In this section, we will learn about how scikit learn sentiment classification works in python.

• Scikit learn sentimental classification is a process of automatically catching the fruitful state of the text.
• Sentimental classification analysis has an important area in natural language processing.
• It is easily analyzed and widely applied to customer reviews it also sees the customer can give positive reviews or negative reviews.

code:

In the following code, we will import some libraries from which we can estimate the sentimental classification analysis.

• positive_folder = f'{folder}/pos’ is used to get the positive reviews.
• negative_folder = f'{folder}/neg’ is used to get the negative reviews.
• return: is used a list with all the files in the input folder.
• fld: is used as a positive or negative review folder.
• positive_files = get_files(positive_folder) is used to get the positive review folder.
• negative_files = get_files(negative_folder) is used to get the negative review folder.
• textfile = list(map(lambda txt: re.sub(‘(<br\s*/?>)+’, ‘ ‘, txt), text) is used to braking line space with break.
• imdb_train = create_data_frame(‘aclImdb/train’) is used to import the train dataframe.
• imdb_test = create_data_frame(‘aclImdb/test’) is used to create the test dataframe.
• unigram_vectorizers.fit(imdb_train[‘text’].values) is used to count the unigrams.
• bigram_vectorizers = CountVectorizer(ngram_range=(1, 2)) is used to count the bigram.
• x_train_bigram_tf_idf = bigram_tf_idf_transformer.transform(x_train_bigram) is used to train biagran_tf_idf.
• classifier.fit(x_train, y_train) is used to fit the classifiers.
• print(f'{title}\nTrain score: {round(train_score, 2)} ; Validation score: {round(valid_score, 2)}\n’) is used to print the title.
• train_and_show_scores(x_train_unigrams, y_train, ‘Unigram Counts’) is used to show the score of train unigram.
• train_and_show_scores(x_train_bigram, y_train, ‘Bigram Counts’) is used yo show the score of train bigram
``````import pandas as pds
import re
from os import system, listdir
from os.path import isfile, join
from random import shuffle

system('wget "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"')
system('tar -xzf "aclImdb_v1.tar.gz"')

def create_data_frame(folder: str) -> pd.DataFrame:

positive_folder = f'{folder}/pos'
negative_folder = f'{folder}/neg'

def get_files(fld: str) -> list:

return [join(fld, f) for f in listdir(fld) if isfile(join(fld, f))]

def append_files_data(data_list: list, files: list, label: int) -> None:
for file_path in files:
with open(file_path, 'r') as f:
data_list.append((text, label))

positive_files = get_files(positive_folder)
negative_files = get_files(negative_folder)

data_list = []
append_files_data(data_list, positive_files, 1)
append_files_data(data_list, negative_files, 0)
shuffle(data_list)

text, label = tuple(zip(*data_list))
textfile = list(map(lambda txt: re.sub('(<br\s*/?>)+', ' ', txt), text))

return pd.DataFrame({'text': text, 'label': label})

imdb_train = create_data_frame('aclImdb/train')
imdb_test = create_data_frame('aclImdb/test')

system("mkdir 'csv'")
imdb_train.to_csv('csv/imdb_train.csv', index=False)
imdb_test.to_csv('csv/imdb_test.csv', index=False)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

system("mkdir 'data_preprocessors'")
system("mkdir 'vectorized_data'")

unigram_v
unigram_vectorizers.fit(imdb_train['text'].values)

dump(unigram_vectorizers, 'data_preprocessors/unigram_vectorizer.joblib')

x_train_unigrams = unigram_vectorizers.transform(imdb_train['text'].values)

save_npz('vectorized_data/X_train_unigram.npz', x_train_unigrams)

# Unigram Tf-Idf

unigram_tf_idf_transformer = TfidfTransformer()
unigram_tf_idf_transformer.fit(x_train_unigrams)

dump(unigram_tf_idf_transformer, 'data_preprocessors/unigram_tf_idf_transformer.joblib')
x_train_unigram_tf_idf = unigram_tf_idf_transformer.transform(x_train_unigrams)

# Bigram Counts

bigram_vectorizers = CountVectorizer(ngram_range=(1, 2))
bigram_vectorizers.fit(imdb_train['text'].values)

dump(bigram_vectorizers, 'data_preprocessors/bigram_vectorizers.joblib')

x_train_bigram = bigram_vectorizers.transform(imdb_train['text'].values)

save_npz('vectorized_data/x_train_bigram.npz', x_train_bigram)

# Bigram Tf-Idf

bigram_tf_idf_transformer = TfidfTransformer()
bigram_tf_idf_transformer.fit(X_train_bigram)

dump(bigram_tf_idf_transformer, 'data_preprocessors/bigram_tf_idf_transformer.joblib')

x_train_bigram_tf_idf = bigram_tf_idf_transformer.transform(x_train_bigram)

save_npz('vectorized_data/x_train_bigram_tf_idf.npz', x_train_bigram_tf_idf)

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix
import numpy as num

def train_and_show_scores(x: csr_matrix, y: np.array, title: str) -> None:
x_train, x_valid, y_train, y_valid = train_test_split(
x, y, train_size=0.75, stratify=y
)

classifier = SGDClassifier()
classifier.fit(x_train, y_train)
train_score = classifier.score(x_train, y_train)
valid_score = classifier.score(x_valid, y_valid)
print(f'{title}\nTrain score: {round(train_score, 2)} ; Validation score: {round(valid_score, 2)}\n')

y_train = imdb_train['label'].values

train_and_show_scores(x_train_unigrams, y_train, 'Unigram Counts')
train_and_show_scores(x_train_unigram_tf_idf, y_train, 'Unigram Tf-Idf')
train_and_show_scores(x_train_bigram, y_train, 'Bigram Counts')
train_and_show_scores(x_train_bigram_tf_idf, y_train, 'Bigram Tf-Idf')
``````

Output:

After running the above code, we get the following output in which we can see that the score of unigram counts, Unigram Tf-Idf, Bigram counts, Bigram Tf-Idf is printed on the screen.

## Scikit learn sentiment logistic regression

In this section, we will learn about How scikit learn sentiment logistic regression works in python.

• Sentiment analysis regression mentions analyzing the feeling about something using data like text.
• It helps the company for making their decision if the public review about the product is not good the company can modify the product.
• They can also stop the production of that kind of product that gives the bad review and in order to keep away from the loss.
• It is easily analyzed and widely applied to people tweets it give the review of the positive tweets or negative tweets.

Code:

In the following code, we will import a count vectorizer to convert the text data into numerical data.

• data.append(i) is used to add the data.
• datalabels.append(‘positive’) is used to add the positive tweets labels.
• datalabels.append(‘negative’) is used to add the negative tweets labels.
• features = vectorizers.fit_transform(data) is used to fit the data.
• x_train,x_test,y_train,y_test=train_test_split(features_nd,datalabels,train_size=0.80,random_state=1234) is used to spit the dataset into train data and test data.
• logreg_model = logreg_model.fit(X=x_train, y=y_train) is used to fit the data into logistic regression model.
• j = rand.randint(0,len(x_test)-7) is used to randomly generate data.
• print(y_pred) is used print the prediction.
• print(data[index].strip()) is used to print the data on the screen.
``````from sklearn.feature_extraction.text import CountVectorizer
data = []
datalabels = []
with open("positive_tweets.txt") as f:
for i in f:
data.append(i)
datalabels.append('positive')

with open("negative_tweets.txt") as f:
for i in f:
data.append(i)
datalabels.append('negative')
vectorizers = CountVectorizer(
analyzer = 'word',
lowercase = False,
)
features = vectorizers.fit_transform(
data
)
features_nd = features.toarray()
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test  = train_test_split(
features_nd,
datalabels,
train_size=0.80,
random_state=1234)
from sklearn.linear_model import LogisticRegression
logreg_model = LogisticRegression()
logreg_model = logreg_model.fit(X=x_train, y=y_train)
y_pred = logreg_model.predict(x_test)
import random as rand
j = rand.randint(0,len(x_test)-7)
for i in range(j,j+7):
print(y_pred)
index = features_nd.tolist().index(x_test[i].tolist())
print(data[index].strip())``````

After running the above code, we get the following output in which we can see that all negative and positive data is printed on the screen.

In this code, we will import accuracy_score from sklearn.metrics by which we can predict the accuracy of the model.

print(accuracy_score(y_test, y_pred)) is used to predict the accuracy of the model.

``````from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))``````

After running the above code, we get the following output in which we can see that the accuracy of the model is printed on the screen.