Guide to K-Nearest Neighbors Algorithm in Machine Learning (2024)

Introduction

In the four years of my data science career, I have built more than 80% of classification models and just 15-20% of regression models. These ratios can be more or less generalized throughout the industry. The reason behind this bias towards classification models is that most analytical problems involve making decisions. In this article, we will talk about one such widely used machine learning classification techniquecalled the k-nearest neighbors (KNN) algorithm. Our focus will primarily be on how the algorithm works on new data and how the input parameter affects the output/prediction.

Note: People who prefer to learn through videos can learn the same through our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R. And if you are a complete beginner to Data Science and Machine Learning, check out our Certified BlackBelt program –

Learning Objectives

  • Understand the working of KNN and how it operates in python and R.
  • Get to know how to choose the right value of k for KNN
  • Understand the difference between training error rate and validation error rate

Table of contents

  • What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?
  • When Do We Use the KNN Algorithm?
  • How Does the KNN Algorithm Work?
  • How Do We Choose the Factor K?
  • Breaking It Down – Pseudo Code of KNN
  • Implementation in Python From Scratch
  • Comparing Our Model With Scikit-learn
  • Implementation of KNN in R
  • Comparing Our KNN Predictor Function With “Class” Library
  • Frequently Asked Questions

What is KNN (K-Nearest Neighbor) Algorithm in Machine Learning?

The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique used for classification and regression tasks. It relies on the idea that similar data points tend to have similar labels or values.

During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.

Next, the algorithm identifies the K nearest neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point.

The KNN algorithm is straightforward and easy to understand, making it a popular choice in various domains. However, its performance can be affected by the choice of K and the distance metric, so careful parameter tuning is necessary for optimal results.

When Do We Use the KNN Algorithm?

KNN Algorithm can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique, we generally look at 3 important aspects:

1. Ease of interpreting output

2. Calculation time

3. Predictive Power

Let us take a few examples to place KNN in the scale :

KNN classifier fairs across all parameters of consideration. It is commonly used for its ease of interpretation and low calculation time.

How Does the KNN Algorithm Work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS):

Guide to K-Nearest Neighbors Algorithm in Machine Learning (3)

You intend to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The “K” in KNN algorithm is the nearest neighbor we wish to take the vote from. Let’s say K = 3. Hence, we will now make a circle with BS as the center just as big as to enclose only three data points on the plane. Refer to the following diagram for more details:

The three closest points to BS are all RC. Hence, with a good confidence level, we can say that the BS should belong to the class RC. Here, the choice became obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next, we will understand the factors to be considered to conclude the best K.

How Do We Choose the Factor K?

First, let us try to understand the influence of the K-nearest neighbors (KNN) in the algorithm. If we consider the last example, keeping all 6 training observations constant, a given K value allows us to establish boundaries for each class. These decision boundaries effectively segregate, for instance, RC from GS. Similarly, let’s examine the impact of the value “K” on these class boundaries. The following illustrates the distinct boundaries that separate the two classes, each corresponding to different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority. The training error rate and the validation error rate are two parameters we need to access different K-value. Following is the curve for the training error rate with a varying value of K :

Guide to K-Nearest Neighbors Algorithm in Machine Learning (7)

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

The above content can be understood more intuitively using our free course – K-Nearest Neighbors (KNN) Algorithm in Python and R

Breaking It Down – Pseudo Code of KNN

We can implement a KNN model by following the below steps:

  1. Load the data
  2. Initialise the value of k
  3. For getting the predicted class, iterate from 1 to total number of training data points
    • Calculate the distance between test data and each row of training dataset.Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other distance function or metrics that can be used are Manhattan distance, Minkowski distance, Chebyshev, cosine, etc. If there are categorical variables, hamming distance can be used.
    • Sort the calculated distances in ascending order based on distance values
    • Get top k rows from the sorted array
    • Get the most frequent class of these rows
    • Return the predicted class

Implementation in Python From Scratch

We will be using the popular Iris dataset for building our KNN model. You can download it from here.

Comparing Our Model With Scikit-learn

from sklearn.neighbors import KNeighborsClassifierneigh = KNeighborsClassifier(n_neighbors=3)neigh.fit(data.iloc[:,0:4], data['Name'])# Predicted classprint(neigh.predict(test))-> ['Iris-virginica']# 3 nearest neighborsprint(neigh.kneighbors(test)[1])-> [[141 139 120]]

We can see that both the models predicted the same class (‘Iris-virginica’) and the same nearest neighbors ( [141 139 120] ). Hence we can conclude that our model runs as expected.

Implementation of KNN in R

Step 1: Importing the data
Step 2: Checking the data and calculating the data summary

Output

#Top observations present in the dataSepalLength SepalWidth PetalLength PetalWidth Name1 5.1 3.5 1.4 0.2 Iris-setosa2 4.9 3.0 1.4 0.2 Iris-setosa3 4.7 3.2 1.3 0.2 Iris-setosa4 4.6 3.1 1.5 0.2 Iris-setosa5 5.0 3.6 1.4 0.2 Iris-setosa6 5.4 3.9 1.7 0.4 Iris-setosa#Check the dimensions of the data[1] 150 5#Summarise the dataSepalLength SepalWidth PetalLength PetalWidth NameMin. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 Iris-setosa :501st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Iris-versicolor:50Median :5.800 Median :3.000 Median :4.350 Median :1.300 Iris-virginica :50Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.1993rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Step 3: Splitting the Data

Step 4: Calculating the Euclidean Distance

Step 5: Writing the function to predict kNN
Step 6: Calculating the label(Name) for K=1

Output

For K=1[1] "Iris-virginica"

In the same way, you can compute for other values of K.

Comparing Our KNN Predictor Function With “Class” Library

Output

For K=1[1] "Iris-virginica"

We can see that both models predicted the same class (‘Iris-virginica’).

Conclusion

The KNN algorithm is one of the simplest classification algorithms. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from k-nearest neighbors. KNN can be coded in a single line on R. I am yet to explore how we can use the KNN algorithm on SAS.

Key Takeaways

  • KNN classifier operates by finding the k nearest neighbors to a given data point, and it takes the majority vote to classify the data point.
  • The value of k is crucial, and one needs to choose it wisely to prevent overfitting or underfitting the model.
  • One can use cross-validation to select the optimal value of k for the k-NN algorithm, which helps improve its performance and prevent overfitting or underfitting. Cross-validation is also used to identify the outliers before applying the KNN algorithm.
  • The above article provides implementations of KNN in Python and R, and it compares the result with scikit-learn and the “Class” library in R.

Frequently Asked Questions

Q1. What is K nearest neighbors algorithm?

A. KNN classifier is a machine learning algorithm used for classification and regression problems. It works by finding the K nearest points in the training dataset and uses their class to predict the class or value of a new data point. It can handle complex data and is also easy to implement, which is why KNN has become a popular tool in the field of artificial intelligence.

Q2. What is KNN algorithm used for?

A. KNN algorithm is most commonly used for:
1. Disease prediction – Predicting the likelihood of diseases based on symptoms.
2. Handwriting recognition – Recognizing handwritten characters.
3. Image classification – Categorizing and recognizing images.

Q3. What is the difference between KNN and Artificial Neural Networks?

A. K-nearest neighbors (KNN) are mainly used for classification and regression problems, while Artificial Neural Networks (ANN) are used for complex function approximation and pattern recognition problems. Moreover, ANN has a higher computational cost than KNN.

K nearestKNNknn from scratchlive codingmachine learningSimplied series

Tavish Srivastava22 May, 2024

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

AlgorithmBig dataBusiness AnalyticsClassificationIntermediate

Guide to K-Nearest Neighbors Algorithm in Machine Learning (2024)

FAQs

What is the best way to choose k in KNN? ›

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

Which is the number of nearby neighbors to be used to classify the new record? ›

We'll then assign a value to K which denotes the number of neighbors to consider before classifying the new data entry. Let's assume the value of K is 3. Since the value of K is 3, the algorithm will only consider the 3 nearest neighbors to the green point (new entry).

What is the goal of the K nearest neighbors algorithm? ›

The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is one of the popular and simplest classification and regression classifiers used in machine learning today.

What is the rule of thumb for KNN? ›

K-NN algorithm is an ad-hoc classifier used to classify test data based on distance metric. However, the value of K is non-parametric and a general rule of thumb in choosing the value of K is K=√n, where n stands for the number of samples in the training dataset.

Which algorithm is better than KNN? ›

KNN vs SVM are both used for classification and regression, but SVM finds the best line to separate data. As well as while KNN looks at the closest points to make predictions. Approach: KNN remembers all the training data and decides on new data based on how close they are to existing data points.

How can you avoid overfitting in KNN? ›

To prevent overfitting, we can smooth the decision boundary by K nearest neighbors instead of 1. Find the K training samples , r = 1 , … , K closest in distance to , and then classify using majority vote among the k neighbors.

Does KNN need training? ›

Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly, which will not impact the accuracy of the algorithm. KNN is very easy to implement. There are only two parameters required to implement KNN—the value of K and the distance function (e.g. Euclidean, Manhattan, etc.)

What are the disadvantages of KNN? ›

3 Disadvantages of KNN

KNN has some drawbacks and challenges, such as computational expense, slow speed, memory and storage issues for large datasets, sensitivity to the choice of k and the distance metric, and susceptibility to the curse of dimensionality.

What happens if k is too large in KNN? ›

The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead to overfitting as well as a big value of k can lead to underfitting. Overfitting imply that the model is well on the training data but has poor performance when new data is coming.

What is the formula for k-nearest neighbor? ›

The k-nearest neighbor classifier fundamentally relies on a distance metric. The better that metric reflects label similarity, the better the classified will be. The most common choice is the Minkowski distance dist(x,z)=(d∑r=1|xr−zr|p)1/p.

Why is KNN a lazy learner? ›

K-NN is a non-parametric algorithm, which means that it does not make any assumptions about the underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the data set and at the time of classification it performs an action on the data set.

How to use KNN to predict? ›

How Does the KNN Algorithm Work?
  1. First, the distance between the new point and each training point is calculated.
  2. The closest k data points are selected (based on the distance). ...
  3. The average of these data points is the final prediction for the new point.

Is KNN a classification or regression? ›

Nov 5, 2023. 76. K-Nearest Neighbors (KNN) is a non-parametric machine learning algorithm that can be used for both classification and regression tasks.

How to improve performance of KNN model? ›

What are the most effective ways to improve k-nearest neighbor search accuracy?
  1. Choose the right k value.
  2. Use a suitable distance metric.
  3. Scale and normalize the data. Be the first to add your personal experience.
  4. Reduce the dimensionality. ...
  5. Use an efficient data structure. ...
  6. Use an ensemble method. ...
  7. Here's what else to consider.
Dec 28, 2023

How do you choose K in K clustering? ›

Average Silhouette Score: Compute the average silhouette score for each K value by taking the mean of all the individual silhouette scores. Identify the Optimal K: Select the K value that yields the highest average silhouette score as the optimal number of clusters.

How would you choose the value of k? ›

The Elbow Method

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

Why does K 1 in KNN give the best accuracy? ›

The kNN classifier with k=1 will come out as the best one since we are choosing our test points from the training points, the 1NN classifier will have remembered the correct label and the 1NN classifier will achieve 0 error rate. We call this overfitting.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Aracelis Kilback

Last Updated:

Views: 5873

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.