K-nearest Neighbors | Brilliant Math & Science Wiki (2024)

Sign up with Facebook or Sign up manually

Already have an account? Log in here.

Akshay Padmanabha and Christopher Williams contributed

k-nearest neighbors (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbors.

For example, suppose a k-NN algorithm was given an input of data points of specific men and women's weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbors (suppose \(k=3\)) and will determine that the input's gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.

K-nearest Neighbors | Brilliant Math & Science Wiki (1)

k-NN is used in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm is used to determine which genes contribute to a certain characteristic. Overall, k-nearest neighbors provides a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.

Contents

  • Properties
  • Classification and Regression
  • Parameter Selection
  • Pros and Cons
  • References

Properties

k-nearest neighbors, as described above, has various properties that differentiate it from other machine learning algorithms. First, k-NN is non-parametric, which means that it does not make any assumptions about the probability distribution of the input. This is useful for applications with input properties that are unknown and therefore makes k-NN more robust than algorithms that are parametric. The contrast is that parametric machine learning algorithms tend to produce fewer errors than non-parametric ones, since taking input probabilities into account can influence decision making.

Furthermore, k-NN is a type of lazy learning, which is a learning method that generalizes data in the testing phase, rather than during the training phase. This is contrasted with eager learning, which generalizes data in the training phase rather than the testing phase. A benefit of lazy learning is that it can quickly adapt to changes, since it is not expecting a certain generalized dataset. However, a major downside is that a huge amount of computation occurs during testing (actual use) rather than pre-computation during training.

Classification and Regression

k-nearest neighbors can be used in classification or regression machine learning tasks. Classification involves placing input points into appropriate categories whereas regression involves establishing a relationship between input points and the rest of the data. In either of these cases, determining a neighbor can be performed using many different notions of distance, with the most common being Euclidean and Hamming distance. Euclidean distance is the most popular notion of distance--the length of a straight line between two points. Hamming distance is the same concept, but for strings distance is calculated as the number of positions where two strings differ. Furthermore, for certain multivariable tasks, distances must be normalized (or weighted) to accurately represent the correlation between variables and their strength of correlation.

For k-NN classification, an input is classified by a majority vote of its neighbors. That is, the algorithm obtains the class membership of its k neighbors and outputs the class that represents a majority of the k neighbors.

K-nearest Neighbors | Brilliant Math & Science Wiki (2) An example of k-NN classification [1]

Suppose we are trying to classify the green circle. Let us begin with \(k=3\) (the solid line). In this case, the algorithm would return a red triangle, since it constitutes a majority of the 3 neighbors. Likewise, with \(k=5\) (the dotted line), the algorithm would return a blue square.

If no majority is reached with the k neighbors, many courses of action can be taken. For example, one could use a plurality system or even use a different algorithm to determine the membership of that data point.

k-NN regression works in a similar manner. The value returned is the average value of the input's k neighbors.

K-nearest Neighbors | Brilliant Math & Science Wiki (3)

Suppose we have data points from a sine wave above (with some variance, of course) and our task is to produce a y value for a given x value. When given an input data point to classify, k-NN would return the average y value of the input's k neighbors. For example, if k-NN were asked to return the corresponding y value for \(x=0\), the algorithm would find the k nearest points to \(x=0\) and return the average y value corresponding to these k points. This algorithm would be simple, but very successful for most x values.

Parameter Selection

Parameter selection is performed for most machine learning algorithms, including k-NN. To determine the number of neighbors to consider when running the algorithm (k), a common method involves choosing the optimal k for a validation set (the one that reduces the percentage of errors) and using that for the test set. In general, a higher k reduces noise and localized anomalies but allows for more error near decision boundaries, and vice versa.

Pros and Cons

k-NN is one of many algorithms used in machine learning tasks, in fields such as computer vision and gene expression analysis. So why use k-NN over other algorithms? The following is a list of pros and cons k-NN has over alternatives.

Pros:

  • Very easy to understand and implement. A k-NN implementation does not require much code and can be a quick and simple way to begin machine learning datasets.
  • Does not assume any probability distributions on the input data. This can come in handy for inputs where the probability distribution is unknown and is therefore robust.
  • Can quickly respond to changes in input. k-NN employs lazy learning, which generalizes during testing--this allows it to change during real-time use.

Cons:

  • Sensitive to localized data. Since k-NN gets all of its information from the input's neighbors, localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data.
  • Computation time. Lazy learning requires that most of k-NN's computation be done during testing, rather than during training. This can be an issue for large datasets.
  • Normalization. If one type of category occurs much more than another, classifying an input will be more biased towards that one category (since it is more likely to be neighbors with the input). This can be mitigated by applying a lower weight to more common categories and a higher weight to less common categories; however, this can still cause errors near decision boundaries.
  • Dimensions. In the case of many dimensions, inputs can commonly be "close" to many data points. This reduces the effectiveness of k-NN, since the algorithm relies on a correlation between closeness and similarity. One workaround for this issue is dimension reduction, which reduces the number of working variable dimensions (but can lose variable trends in the process).

References

  1. Ajanki, A. Example of k-nearest neighbour classificationnb. Retrieved May 28, 2016, from https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm#/media/File:KnnClassification.svg

Cite as: K-nearest Neighbors. Brilliant.org. Retrieved from https://brilliant.org/wiki/k-nearest-neighbors/

K-nearest Neighbors | Brilliant Math & Science Wiki (2024)

FAQs

What is the mathematical formula for K nearest neighbor? ›

The k-NN algorithm

Define the set of the k nearest neighbors of x as Sx. Formally Sx is defined as Sx⊆D s.t. |Sx|=k and ∀(x′,y′)∈D∖Sx, dist(x,x′)≥max(x″,y″)∈Sxdist(x,x″), (i.e. every point in D but not in Sx is at least as far away from x as the furthest point in Sx).

Why is KNN not used? ›

The main disadvantage lies in its resource-intensive nature, as KNN does not generate a trained model but stores all training examples, leading to high operational costs and time consumption, especially in large datasets .

Is KNN tricky to implement? ›

Easy to implement and understand: To implement the KNN algorithm, we need only two parameters i.e. the value of K and the distance metric(e.g. Euclidean or Manhattan, etc.). Since both the parameters are easily interpretable therefore they are easy to understand.

Why is KNN a lazy learner? ›

K-NN is a non-parametric algorithm, which means that it does not make any assumptions about the underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the data set and at the time of classification it performs an action on the data set.

What is k-nearest neighbor in simple terms? ›

What is KNN? K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified.

Which algorithm is better than KNN? ›

KNN vs SVM are both used for classification and regression, but SVM finds the best line to separate data. As well as while KNN looks at the closest points to make predictions. Approach: KNN remembers all the training data and decides on new data based on how close they are to existing data points.

What are the disadvantages of K nearest neighbor? ›

KNN has some drawbacks and challenges, such as computational expense, slow speed, memory and storage issues for large datasets, sensitivity to the choice of k and the distance metric, and susceptibility to the curse of dimensionality.

What is the real life use of KNN? ›

K Nearest Neighbor (KNN) is a very simple, easy-to-understand, and versatile machine learning algorithm. It's used in many different areas, such as handwriting detection, image recognition, and video recognition.

Why is KNN bad for large datasets? ›

The KNN algorithm does not work well with large datasets. The cost of calculating the distance between the new point and each existing point is huge, which degrades performance. Feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset.

Which distance is best for KNN? ›

The most intuitive and widely used distance metric for KNN is the Euclidean distance, which is the straight-line distance between two points in a vector space. It is calculated by taking the square root of the sum of the squared differences between the corresponding coordinates of the two points.

How to improve KNN accuracy? ›

What are the most effective ways to improve k-nearest neighbor search accuracy?
  1. Choose the right k value.
  2. Use a suitable distance metric.
  3. Scale and normalize the data. Be the first to add your personal experience.
  4. Reduce the dimensionality. ...
  5. Use an efficient data structure. ...
  6. Use an ensemble method. ...
  7. Here's what else to consider.
Dec 28, 2023

When should we not use KNN? ›

The algorithm depends on past observations. Costly to calculate distances on large datasets. Costly to calculate distances on high-dimensional data.

Does KNN memorize the entire training set? ›

During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.

What is the best way to choose k in KNN? ›

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

What is the distance formula for k nearest neighbors? ›

How KNN Uses Distance Measures?
  • distance = (7 + 4)
  • distance = 11. ...
  • X1(x1, y1) = X1(3, 4)
  • X2(x2, y2) = X2(4, 7) ...
  • distance = ( (x 2-x 1)4 + (y 2-y 1)4 )1/4 ...
  • X1 = [ 0,0,0,1,0,1,1,0,1,1,1,0,0,0,1]
  • X2 = [ 0,1,0,1,0,1,0,0,1,0,1,0,1,0,1] ...
  • Hamming distance(X1, X2) = 3 ……… {place where binary vectors are differ}
Aug 6, 2021

How to find k nearest neighbors? ›

How Does the K-Nearest Neighbors Algorithm Work?
  1. Step #1 - Assign a value to K.
  2. Step #2 - Calculate the distance between the new data entry and all other existing data entries (you'll learn how to do this shortly). ...
  3. Step #3 - Find the K nearest neighbors to the new entry based on the calculated distances.
Jan 25, 2023

What is maths nearest Neighbour? ›

Finding the nearest neighbor is the process of plotting all the vectors in all their dimensions and then comparing a context collection of vectors to them. Using a simple coordinate system you can mathematically measure how far one point is from another (known as their distance).

What is the formula for average nearest neighbor? ›

The average nearest neighbor ratio is calculated as the observed average distance divided by the expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area).

Top Articles
2013 Lexus RX 350 for sale - Inver Grove Heights, MN - craigslist
2008 Lexus RX 350 for sale - Inver Grove Heights, MN - craigslist
How To Start a Consignment Shop in 12 Steps (2024) - Shopify
Caesars Rewards Loyalty Program Review [Previously Total Rewards]
Odawa Hypixel
Euro (EUR), aktuální kurzy měn
Farepay Login
Mileage To Walmart
The Realcaca Girl Leaked
Plus Portals Stscg
Roblox Developers’ Journal
Best Restaurants In Seaside Heights Nj
Nestle Paystub
Cape Cod | P Town beach
A.e.a.o.n.m.s
litter - tłumaczenie słowa – słownik angielsko-polski Ling.pl
Slag bij Plataeae tussen de Grieken en de Perzen
What to do if your rotary tiller won't start – Oleomac
Pvschools Infinite Campus
Five Day National Weather Forecast
The Largest Banks - ​​How to Transfer Money With Only Card Number and CVV (2024)
Kp Nurse Scholars
Tamilyogi Proxy
Zoe Mintz Adam Duritz
Dallas Craigslist Org Dallas
eHerkenning (eID) | KPN Zakelijk
Self-Service ATMs: Accessibility, Limits, & Features
Ups Print Store Near Me
2013 Ford Fusion Serpentine Belt Diagram
Dewalt vs Milwaukee: Comparing Top Power Tool Brands - EXTOL
Nesb Routing Number
Hefkervelt Blog
3569 Vineyard Ave NE, Grand Rapids, MI 49525 - MLS 24048144 - Coldwell Banker
Marokko houdt honderden mensen tegen die illegaal grens met Spaanse stad Ceuta wilden oversteken
Roadtoutopiasweepstakes.con
140000 Kilometers To Miles
Ultra Clear Epoxy Instructions
How to Watch the X Trilogy Starring Mia Goth in Chronological Order
Mistress Elizabeth Nyc
Shih Tzu dogs for sale in Ireland
Tiny Pains When Giving Blood Nyt Crossword
Trizzle Aarp
Oxford House Peoria Il
Timberwolves Point Guard History
F9 2385
The All-New MyUMobile App - Support | U Mobile
Gasoline Prices At Sam's Club
Wgu Admissions Login
Willkommen an der Uni Würzburg | WueStart
Acuity Eye Group - La Quinta Photos
Inside the Bestselling Medical Mystery 'Hidden Valley Road'
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5867

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.