💡K-Nearest Neighbors (KNN)

KNN is a nonparametric method that does not consider the underlying assumptions on the shape of the probability distribution of the population. KNN is a supervised learner for both classification and regression. The idea of KNN is as follow:

  • Compute the distance (e.g. Euclidean distance, etc.) of a new observation with all the training data to find its k-nearest neighbors, and predict the label from these.

  • Regression: calculate a numerical value (e.g. the average value of k neighbors) as the prediction value

  • Classification: take a majority vote of k neighbors' label

Nearest neighbors can deal with large number of classification and regression problems, and is often successful in classificaition situations where the decision boundary is very irregular.

import numpy as np
from scipy.stats import mode

class KNNClassifier():
    def _init_(self, K):
        self.K=K
    
    def fit(self, X_train, y_train):
        # read from data
        self.X_train=X_train
        self.y_train=y_train
        # number of train examples
        self.n_train=X_train.shape[0]
        return self
    
    def predict(self, X_test)):
        self.X_test=X_test
        # number of test examples, number of features      
        self.m_test, self.n = X_test.shape        
        # initialize y_pred        
        y_pred = np.zeros(self.m_test)
        
        for i in range(self.m_test):
            # find k nearest neighbors
            p=self.X_test[i]
            neighbors=np.zeros(self.K)
            neighbors=self.find_neighbors(p)
            # take majority vote of k neighbors
            y_pred[i]=mode(neighbors)[0][0]
        return y_pred
    
    def find_neighbors(self, p):
        euclidean_dis=np.zeros(self.n_train)
        for i in range(self.n_train):
            dis=self.euclidean(p, self.n_train[i])
            euclidean_dis[i]=dis
        
        # sort training labels in ascending order based on euclidean_dis array
        index=euclidean_dis.argsort()
        y_train_sorted=self.y_train[index]
        return y_train_sorted[:self.K]
    
    def euclidean(self, p, p_train):
        return np.sqrt(np.sum(np.square(p-p_train)))

Code snippets are adapted from GeeksforGeeks.

Hyperparameters K can impact the result. Larger K will lead to overfitting while smaller K may lead to underfitting. Thus, K needs to be chosen carefully. The optimal K value usually is the square root of N, where N is the total number of samples.

Last updated