In this Python tutorial, we will learn How Scikit learn KNN works using Python and we will also cover different examples related to Scikit learn KNN. And we will cover these topics.
- Scikit learn KNN
- Scikit learn KNN classification
- Scikit learn KNN advantages and disadvantages
- Scikit learn KNN Regression
- Scikit learn KNN Regression Example
- Scikit learn KNN Example
- Scikit learn KNN Imputation
- Scikit learn KNN Distance
Scikit learn KNN
In this section, we will learn about How Scikit learn KNN works in Python.
- KNN stands for K Nearest Neighbours it is the simple and easiest algorithm of machine learning.
- KNN is the supervised learning technique it is used for classification and regression both but it is mainly used for classification.
- KNN algorithm supposes the similarity between the available data and new data after assuming put the new data in that category which is similar to the new category.
- KNN is an example of a lazy learner algorithm because it does not learn from the training set rather it stores the dataset and at the time of classification, it performs its action.
Code:
In the following code, we will import KNeighborsclssifier from sklearn.neighbors which suppose the similarities between the available data and new data after assuming put the new data in the=at category which is similar and nearest to the new category.
- neighbour = KNeighborsClassifier(n_neighbors=3) it show that there is 3 nearest neighbors.
- neighbour.fit(x, y) is used to fit the nearest neighbor.
x = [[1], [2], [3], [4]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neighbour = KNeighborsClassifier(n_neighbors=3)
neighbour.fit(x, y)
In the following code, we made the nearest neighbor class with the help of an array that represents our dataset and arises the question of which is the closest point to[2,2,2].
samples = [[0., 0., 0.], [0., .6, 0.], [2., 1., .6]]
from sklearn.neighbors import NearestNeighbors
neighbour = NearestNeighbors(n_neighbors=1)
neighbour.fit(samples)
print(neighbour.kneighbors([[2., 2., 2.]]))
After running the above code, we get the following output in which we can see that it return [1.72046], and[2] which means the element is at the distance 1.72046.
Read: Scikit-learn Vs Tensorflow
Scikit learn KNN classification
In this section, we will learn about how Scikit learn KNN classification works in python.
- Scikit learn KNN is a non-parametric classification method. It is used for both classification and regression but is mainly used for classification.
- In KNN classification the output depends upon the class members and the object is classified by a variety of votes of its neighbors and the object is assigned to that class that is nearest to the k nearest neighbors.
Code:
In the following code, we will import neighbors, datasets from sklearn by which we can assign the object to that class that is nearest to the k nearest neighbors.
- X = iris.data[:, :2] is used to take the first two features.
- h = 0.04 is used to take the step the size.
- colormap_light = ListedColormap([“yellow”, “orange”, “lightblue”]) is used to create the colormap.
- classifier = neighbors.KNeighborsClassifier(n_neighbors, weights=weights) is used to create an instance of the neighbors.
- classifier.fit(X, y) is used to fit the data.
- xx, yy = num.meshgrid(num.arange(x_min, x_max, h), num.arange(y_min, y_max, h)) is used to plot the decision boundaries.
- plot.contourf(xx, yy, Z, colormap=colormap_light) is used to put the result into color plot.
import numpy as num
import matplotlib.pyplot as plot
import seaborn as sb
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 17
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
h = 0.04
colormap_light = ListedColormap(["yellow", "orange", "lightblue"])
colormap_bold = ["pink", "c", "darkblue"]
for weights in ["uniform", "distance"]:
classifier = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
classifier.fit(X, y)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = num.meshgrid(num.arange(x_min, x_max, h), num.arange(y_min, y_max, h))
Z = classifier.predict(num.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plot.figure(figsize=(8, 6))
plot.contourf(xx, yy, Z, colormap=colormap_light)
# Plot the training points
sb.scatterplot(
x=X[:, 0],
y=X[:, 1],
hue=iris.target_names[y],
palette=colormap_bold,
alpha=1.0,
edgecolor="brown",
)
plot.xlim(xx.min(), xx.max())
plot.ylim(yy.min(), yy.max())
plot.title(
"3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights)
)
plot.xlabel(iris.feature_names[0])
plot.ylabel(iris.feature_names[1])
plot.show()
Output:
After running the above code we get the following output in which we can see that scikit learn KNN classification is done on the screen.
Read: Scikit learn Classification Tutorial
Scikit learn KNN advantages and disadvantages
In this section, we will learn about the scikit learn KNN advantages and disadvantages in python.
Advantages:
- Scikit learn KNN algorithm is simple and easy to implement.
- Scikit learn KNN before making any prediction there is no requirement of training.
- In scikit learn KNN, the new data can be added ideally.
- Scikit learn is also called lazy learner because it does not require any training for real-time prediction.
- The Lazy learner makes the KNN much faster than the other algorithm.
- KNN algorithm is very powerful the noisy training data.
Disadvantage:
- In the KNN algorithm, we always calculate the value of K which may be difficult some of the time.
- In KNN the cost of prediction for the large dataset is high because the cost of calculating the distance between the new data and exciting data is high.
- KNN is very sensitive to noisy datasets.
- In KNN we need to do feature scaling before applying the KNN algorithm to any dataset.
- KNN does not work with categorical features because it difficult to find the distance of the categorical variables.
Read: Scikit learn Hyperparameter Tuning
Scikit learn KNN Regression
In this section, we will learn about how scikit learn KNN regression works in Python.
- Scikit learn KNN regression algorithm is defines as the value of regression is the average of the value of the K nearest neighbors.
- In the scikit learn KNN the output of the program is the property value for the object.
Code:
In the following code, we will import KNeighborsRegressor from sklearn.neighbors by which the value of regression is the average of the value of K-nearest neighbor.
- neighbor = KNeighborsRegressor(n_neighbors=4) is used to find the K-neighbor of a point.
- neighbor.fit(X, y) is used to fit the k-nearest neighbor regression for the training set.
X = [[1], [2], [3], [4]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
neighbor = KNeighborsRegressor(n_neighbors=4)
neighbor.fit(X, y)
After running the above code, we get the following output in which we can see that the scikit learn knn regression point is printed on the screen.
Read: Scikit learn Linear Regression
Scikit learn KNN Regression Example
In this section, we will discuss a scikit learn KNN Regression example in python.
As we know, the scikit learn KNN regression algorithm is defined as the value of regression is the average of the value of the K nearest neighbors.
Code:
In the following code, we will import neighbors from sklearn by which we get the value of regression.
- y[::5] += 1 * (0.5 – np.random.rand(8)) is used to add noise to targets.
- n_neighbors = 7 is fit regression model.
- knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights) is used to find the k-neighbor of a point.
- y_ = knn.fit(x, y).predict(T) is used to fit the K- neighbors regression for the training set.
- plot.plot(T, y_, color=”blue”, label=”prediction”) is used plot the graph on the screen.
- plot.title(“KNeighborsRegressor (k = %i, weights = ‘%s’)” % (n_neighbors, weights)) is used to give the title to the graph.
import numpy as num
import matplotlib.pyplot as plot
from sklearn import neighbors
num.random.seed(0)
x = num.sort(5 * np.random.rand(40, 1), axis=0)
t = num.linspace(0, 5, 500)[:, num.newaxis]
y = num.sin(x).ravel()
y[::5] += 1 * (0.5 - np.random.rand(8))
n_neighbors = 7
for i, weights in enumerate(["uniform", "distance"]):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_ = knn.fit(x, y).predict(T)
plot.subplot(2, 1, i + 1)
plot.scatter(x, y, color="pink", label="data")
plot.plot(T, y_, color="blue", label="prediction")
plot.axis("tight")
plot.legend()
plot.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors, weights))
plot.tight_layout()
plot.show()
Output:
After running the above code, we get the following output in which we can see that the scikit learn regression graph is plotted on the screen.
Read: Scikit learn Feature Selection
Scikit learn KNN Example
In this section, we will learn about how scikit learn KNN example works in python.
- KNN stands for K-nearest-neighbor is a non-parametric classification algorithm. It is used for both classification and regression but is mainly used for classification.
- KNN algorithm supposes the similarity between the available data and new data after assuming put the new data in that category which is similar to the new category.
Code:
In the following output, we will import KneihborsClassifier from sklearn.neighbors by which we can evaluate how likely the data point is to be a member of the one group.
- iris = datasets.load_iris() is used to load the data.
- classifier1 = DecisionTreeClassifier(max_depth=4) is used as training classifier.
- xx, yy = num.meshgrid(num.arange(x_min, x_max, 0.1), num.arange(y_min, y_max, 0.1)) is used as plotting the decision region.
- Z = classifier.predict(num.c_[xx.ravel(), yy.ravel()]) is used to predict the classifier.
- axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor=”r”) is used to plot the dots on the graph.
import numpy as num
import matplotlib.pyplot as plot
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
classifier1 = DecisionTreeClassifier(max_depth=4)
classifier2 = KNeighborsClassifier(n_neighbors=7)
classifier3 = SVC(gamma=0.1, kernel="rbf", probability=True)
eclassifier = VotingClassifier(
estimators=[("dt", classifier1), ("knn", classifier2), ("svc", classifier3)],
voting="soft",
weights=[2, 1, 2],
)
classifier1.fit(X, y)
classifier2.fit(X, y)
classifier3.fit(X, y)
eclassifier.fit(X, y)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = num.meshgrid(num.arange(x_min, x_max, 0.1), num.arange(y_min, y_max, 0.1))
f, axarr = plot.subplots(2, 2, sharex="col", sharey="row", figsize=(10, 8))
for idx, classifier, tt in zip(
product([0, 1], [0, 1]),
[classifier1, classifier2, classifier3, eclassifier],
["Decision Tree (depth=4)", "KNN (k=7)", "Kernel SVM", "Soft Voting"],
):
Z = classifier.predict(num.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="r")
axarr[idx[0], idx[1]].set_title(tt)
plot.show()
Output:
After running the above code, we get the following output in which we can see that scikit learn KNN examples are shown on the screen.
Read: Scikit learn Ridge Regression
Scikit learn KNN Imputation
In this section, we will learn about how scikit learn KNN imputation works in python.
- KNN is a k-neighbor algorithm that is used to identify the K samples which are closed and similar to the available data.
- We use the k samples to make guess the value of missing data points. By the mean value of k neighbor, we can impute the sample missing values.
Code:
In the following code, we will import load_diabetes from sklearn.datasets by which we can replace the missing value of this dataset with imputer.
- n_missing_samples = int(n_samples * missing_rate) is used to add missing valuesin 75% of the lines.
- from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer is used to import the imputer from sklearn.impute module.
- full_scores=cross_val_score(regressor,X_full,y_full,scoring=”neg_mean_squared_error”, cv=N_SPLITS) is used to estimate the score.
- imputer=SimpleImputer(missing_values=np.nan,add_indicator=True,strategy=”constant”, fill_value=0) is used to replace the missing value bu zero.
- imputer = KNNImputer(missing_values=np.nan, add_indicator=True) knn_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing) is used to impute the missing value by desired number of nearest neighbors.
- imputer=SimpleImputer(missing_values=np.nan,strategy=”mean”,add_indicator=True) is used to impute the missing value with mean.
- plot.figure(figsize=(12, 6)) is used to plot the figure.
- axis1.set_title(“KNN Imputation with Diabetes Data”) is used to give the title to the graph.
import numpy as num
from sklearn.datasets import load_diabetes
range = num.random.RandomState(42)
X_diabetes, y_diabetes = load_diabetes(return_X_y=True)
def add_missing_values(X_full, y_full):
n_samples, n_features = X_full.shape
missing_rate = 0.75
n_missing_samples = int(n_samples * missing_rate)
missing_samples = np.zeros(n_samples, dtype=bool)
missing_samples[:n_missing_samples] = True
range.shuffle(missing_samples)
missing_features = range.randint(0, n_features, n_missing_samples)
X_missing = X_full.copy()
X_missing[missing_samples, missing_features] = num.nan
y_missing = y_full.copy()
return X_missing, y_missing
X_miss_diabetes, y_miss_diabetes = add_missing_values(X_diabetes, y_diabetes)
range = num.random.RandomState(0)
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
N_SPLITS = 5
regressor = RandomForestRegressor(random_state=0)
def get_scores_for_imputer(imputer, X_missing, y_missing):
estimator = make_pipeline(imputer, regressor)
impute_scores = cross_val_score(
estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
)
return impute_scores
x_labels = []
mses_diabetes = num.zeros(5)
stds_diabetes = num.zeros(5)
def get_full_score(X_full, y_full):
full_scores = cross_val_score(
regressor, X_full, y_full, scoring="neg_mean_squared_error", cv=N_SPLITS
)
return full_scores.mean(), full_scores.std()
mses_diabetes[0], stds_diabetes[0] = get_full_score(X_diabetes, y_diabetes)
x_labels.append("Full data")
def get_impute_zero_score(X_missing, y_missing):
imputer = SimpleImputer(
missing_values=np.nan, add_indicator=True, strategy="constant", fill_value=0
)
zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
return zero_impute_scores.mean(), zero_impute_scores.std()
mses_diabetes[1], stds_diabetes[1] = get_impute_zero_score(
X_miss_diabetes, y_miss_diabetes
)
x_labels.append("Zero imputation")
def get_impute_knn_score(X_missing, y_missing):
imputer = KNNImputer(missing_values=np.nan, add_indicator=True)
knn_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
return knn_impute_scores.mean(), knn_impute_scores.std()
mses_diabetes[2], stds_diabetes[2] = get_impute_knn_score(
X_miss_diabetes, y_miss_diabetes
)
x_labels.append("KNN Imputation")
def get_impute_mean(X_missing, y_missing):
imputer = SimpleImputer(missing_values=np.nan, strategy="mean", add_indicator=True)
mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
return mean_impute_scores.mean(), mean_impute_scores.std()
mses_diabetes[3], stds_diabetes[3] = get_impute_mean(X_miss_diabetes, y_miss_diabetes)
x_labels.append("Mean Imputation")
def get_impute_iterative(X_missing, y_missing):
imputer = IterativeImputer(
missing_values=np.nan,
add_indicator=True,
random_state=0,
n_nearest_features=5,
sample_posterior=True,
)
iterative_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
return iterative_impute_scores.mean(), iterative_impute_scores.std()
mses_diabetes[4], stds_diabetes[4] = get_impute_iterative(
X_miss_diabetes, y_miss_diabetes
)
x_labels.append("Iterative Imputation")
mses_diabetes = mses_diabetes * -1
import matplotlib.pyplot as plot
n_bars = len(mses_diabetes)
xval = num.arange(n_bars)
colors = ["b", "pink", "r", "orange", "green"]
plot.figure(figsize=(12, 6))
axis1 = plot.subplot(121)
for j in xval:
axis1.barh(
j,
mses_diabetes[j],
xerr=stds_diabetes[j],
color=colors[j],
alpha=0.6,
align="center",
)
axis1.set_title("KNN Imputation with Diabetes Data")
axis1.set_xlim(left=num.min(mses_diabetes) * 0.9, right=num.max(mses_diabetes) * 1.1)
axis1.set_yticks(xval)
axis1.set_xlabel("MSE")
axis1.invert_yaxis()
axis1.set_yticklabels(x_labels)
plot.show()
Output:
After running the above code, we get the following output in which we can see that the KNN Imputation with diabetes data is plotted on the screen.
Read: Scikit learn Hidden Markov Model
Scikit learn KNN Distance
In this section, we will learn about how scikit learn KNN distance in python.
- Scikit learn KNN distance is defined as measuring the distance of the nearest neighbors from the dataset.
- KNN algorithm supposes the similarity between the available data and new data after assuming put the new data in that category which is similar to the new category.
Code:
In the following code, we will import NearestNeighbors from sklearn.neighbors by which we can measure the distance of the nearest neighbor from the dataset.
- Input_data = num.array([[-2, 2], [-3, 3], [-4, 4], [1, 2], [2, 3], [3, 4],[4, 5]]) is used to define the set of data.
- nearest_neighbor.fit(Input_data) is used to fit the model with input dataset.
- distances, indices = nearest_neighbor.kneighbors(Input_data) is used to find the K nearest neighbor of dataset.
from sklearn.neighbors import NearestNeighbors
import numpy as num
Input_data = num.array([[-2, 2], [-3, 3], [-4, 4], [1, 2], [2, 3], [3, 4],[4, 5]])
nearest_neighbor = NearestNeighbors(n_neighbors = 4, algorithm='ball_tree')
nearest_neighbor.fit(Input_data)
distances, indices = nearest_neighbor.kneighbors(Input_data)
indices
distances
nearest_neighbor.kneighbors_graph(Input_data).toarray()
Output:
After running the above code, we get the following output in which we can see that the scikit learn KNN distance is printed on the screen in the form of an array.
You may also like to read the following tutorials on Scikit Learn.
So, in this tutorial, we discussed the working of Scikit learn KNN and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.
- Scikit learn KNN
- Scikit learn KNN classification
- Scikit learn KNN advantages and disadvantages
- Scikit learn KNN Regression
- Scikit learn KNN Regression Example
- Scikit learn KNN Example
- Scikit learn KNN Imputation
- Scikit learn KNN Distance
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.