K-nearest Neighbors | Brilliant Math & Science Wiki (2024)

Already have an account? Log in here.

Akshay Padmanabha and Christopher Williams contributed

k-nearest neighbors (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbors.

For example, suppose a k-NN algorithm was given an input of data points of specific men and women's weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbors (suppose \(k=3\)) and will determine that the input's gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.

k-NN is used in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm is used to determine which genes contribute to a certain characteristic. Overall, k-nearest neighbors provides a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.

Properties

k-nearest neighbors, as described above, has various properties that differentiate it from other machine learning algorithms. First, k-NN is non-parametric, which means that it does not make any assumptions about the probability distribution of the input. This is useful for applications with input properties that are unknown and therefore makes k-NN more robust than algorithms that are parametric. The contrast is that parametric machine learning algorithms tend to produce fewer errors than non-parametric ones, since taking input probabilities into account can influence decision making.

Furthermore, k-NN is a type of lazy learning, which is a learning method that generalizes data in the testing phase, rather than during the training phase. This is contrasted with eager learning, which generalizes data in the training phase rather than the testing phase. A benefit of lazy learning is that it can quickly adapt to changes, since it is not expecting a certain generalized dataset. However, a major downside is that a huge amount of computation occurs during testing (actual use) rather than pre-computation during training.

Classification and Regression

Parameter Selection

Parameter selection is performed for most machine learning algorithms, including k-NN. To determine the number of neighbors to consider when running the algorithm (k), a common method involves choosing the optimal k for a validation set (the one that reduces the percentage of errors) and using that for the test set. In general, a higher k reduces noise and localized anomalies but allows for more error near decision boundaries, and vice versa.

Pros and Cons

k-NN is one of many algorithms used in machine learning tasks, in fields such as computer vision and gene expression analysis. So why use k-NN over other algorithms? The following is a list of pros and cons k-NN has over alternatives.

Pros:

Very easy to understand and implement. A k-NN implementation does not require much code and can be a quick and simple way to begin machine learning datasets.
Does not assume any probability distributions on the input data. This can come in handy for inputs where the probability distribution is unknown and is therefore robust.
Can quickly respond to changes in input. k-NN employs lazy learning, which generalizes during testing--this allows it to change during real-time use.

Cons:

Sensitive to localized data. Since k-NN gets all of its information from the input's neighbors, localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data.
Computation time. Lazy learning requires that most of k-NN's computation be done during testing, rather than during training. This can be an issue for large datasets.
Normalization. If one type of category occurs much more than another, classifying an input will be more biased towards that one category (since it is more likely to be neighbors with the input). This can be mitigated by applying a lower weight to more common categories and a higher weight to less common categories; however, this can still cause errors near decision boundaries.
Dimensions. In the case of many dimensions, inputs can commonly be "close" to many data points. This reduces the effectiveness of k-NN, since the algorithm relies on a correlation between closeness and similarity. One workaround for this issue is dimension reduction, which reduces the number of working variable dimensions (but can lose variable trends in the process).

References

Ajanki, A. Example of k-nearest neighbour classificationnb. Retrieved May 28, 2016, from https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm#/media/File:KnnClassification.svg

Cite as: K-nearest Neighbors. Brilliant.org. Retrieved from https://brilliant.org/wiki/k-nearest-neighbors/

K-nearest Neighbors | Brilliant Math & Science Wiki (2024)

FAQs

What is the mathematical formula for K nearest neighbor? ›

The k-NN algorithm

Define the set of the k nearest neighbors of x as Sx. Formally Sx is defined as Sx⊆D s.t. |Sx|=k and ∀(x′,y′)∈D∖Sx, dist(x,x′)≥max(x″,y″)∈Sxdist(x,x″), (i.e. every point in D but not in Sx is at least as far away from x as the furthest point in Sx).

Tell Me More ›

Why is KNN not used? ›

The main disadvantage lies in its resource-intensive nature, as KNN does not generate a trained model but stores all training examples, leading to high operational costs and time consumption, especially in large datasets .

Discover More Details ›

Is KNN tricky to implement? ›

Easy to implement and understand: To implement the KNN algorithm, we need only two parameters i.e. the value of K and the distance metric(e.g. Euclidean or Manhattan, etc.). Since both the parameters are easily interpretable therefore they are easy to understand.

Why is KNN a lazy learner? ›

K-NN is a non-parametric algorithm, which means that it does not make any assumptions about the underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the data set and at the time of classification it performs an action on the data set.

How to improve KNN accuracy? ›

What are the most effective ways to improve k-nearest neighbor search accuracy?

Choose the right k value.
Use a suitable distance metric.
Scale and normalize the data. Be the first to add your personal experience.
Reduce the dimensionality. ...
Use an efficient data structure. ...
Use an ensemble method. ...
Here's what else to consider.

Dec 28, 2023

When should we not use KNN? ›

The algorithm depends on past observations. Costly to calculate distances on large datasets. Costly to calculate distances on high-dimensional data.

Know More ›

Does KNN memorize the entire training set? ›

During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.

Learn More Now ›

What is the best way to choose k in KNN? ›

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

View Details ›

What is the distance formula for k nearest neighbors? ›

How KNN Uses Distance Measures?

distance = (7 + 4)
distance = 11. ...
X1(x₁, y₁) = X1(3, 4)
X2(x₂, y₂) = X2(4, 7) ...
distance = ( (x ₂-x ₁)⁴ + (y ₂-y ₁)⁴ )¹^/⁴ ...
X1 = [ 0,0,0,1,0,1,1,0,1,1,1,0,0,0,1]
X2 = [ 0,1,0,1,0,1,0,0,1,0,1,0,1,0,1] ...
Hamming distance(X1, X2) = 3 ……… {place where binary vectors are differ}

More items...

Aug 6, 2021

Tell Me More ›

How to find k nearest neighbors? ›

How Does the K-Nearest Neighbors Algorithm Work?

Step #1 - Assign a value to K.
Step #2 - Calculate the distance between the new data entry and all other existing data entries (you'll learn how to do this shortly). ...
Step #3 - Find the K nearest neighbors to the new entry based on the calculated distances.

More items...

Jan 25, 2023

Get More Info ›

What is maths nearest Neighbour? ›

Finding the nearest neighbor is the process of plotting all the vectors in all their dimensions and then comparing a context collection of vectors to them. Using a simple coordinate system you can mathematically measure how far one point is from another (known as their distance).

Read The Full Story ›

What is the formula for average nearest neighbor? ›

The average nearest neighbor ratio is calculated as the observed average distance divided by the expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area).

Get More Info ›

K-nearest Neighbors | Brilliant Math & Science Wiki (2024)

Contents

FAQs

What is the mathematical formula for K nearest neighbor? ›

How to improve KNN accuracy? ›