💡K-Nearest Neighbors (KNN)
KNN is a nonparametric method that does not consider the underlying assumptions on the shape of the probability distribution of the population. KNN is a supervised learner for both classification and regression. The idea of KNN is as follow:
Compute the distance (e.g. Euclidean distance, etc.) of a new observation with all the training data to find its k-nearest neighbors, and predict the label from these.
Regression: calculate a numerical value (e.g. the average value of k neighbors) as the prediction value
Classification: take a majority vote of k neighbors' label
Nearest neighbors can deal with large number of classification and regression problems, and is often successful in classificaition situations where the decision boundary is very irregular.
import numpy as np
from scipy.stats import mode
class KNNClassifier():
def _init_(self, K):
self.K=K
def fit(self, X_train, y_train):
# read from data
self.X_train=X_train
self.y_train=y_train
# number of train examples
self.n_train=X_train.shape[0]
return self
def predict(self, X_test)):
self.X_test=X_test
# number of test examples, number of features
self.m_test, self.n = X_test.shape
# initialize y_pred
y_pred = np.zeros(self.m_test)
for i in range(self.m_test):
# find k nearest neighbors
p=self.X_test[i]
neighbors=np.zeros(self.K)
neighbors=self.find_neighbors(p)
# take majority vote of k neighbors
y_pred[i]=mode(neighbors)[0][0]
return y_pred
def find_neighbors(self, p):
euclidean_dis=np.zeros(self.n_train)
for i in range(self.n_train):
dis=self.euclidean(p, self.n_train[i])
euclidean_dis[i]=dis
# sort training labels in ascending order based on euclidean_dis array
index=euclidean_dis.argsort()
y_train_sorted=self.y_train[index]
return y_train_sorted[:self.K]
def euclidean(self, p, p_train):
return np.sqrt(np.sum(np.square(p-p_train)))
Code snippets are adapted from GeeksforGeeks.
Hyperparameters K can impact the result. Larger K will lead to overfitting while smaller K may lead to underfitting. Thus, K needs to be chosen carefully. The optimal K value usually is the square root of N, where N is the total number of samples.
Last updated