Detailed explanation of K-Nearest Neighbors algorithm
Internet is full of articles about K-Nearest Neighbors algorithm, but I noticed that not all of them fully explain the logic behind the algorithm, and those that do are very academic. I will break down this algorithm in details, so it is understandable for many, instead of a few.
1. The intuition behind k-NN
k-NN is coming from the family of supervised learning algorithms, which goal is to make accurate predictions about unknown data after being trained on known data.
Specifically, the intuition behind KNN is to look at the data points nearby to classify an unknown data point. For example, if you find a data point which is surrounded by many blue circles and only a few red ones, then you are more likely to classify it as a blue circle too. (Note: as you may have already guessed, in this case there are 2 classes in our data set : red and blue circles, but there can be more).
2. What “k” stands for in k-NN?
Now here comes another question: how far we go with estimation of nearest neighbors for identification (classification) of our unknown data item? Well, that is why we call the algorithm “k” nearest neighbors, whereas “k” means the number of nearest neighbors we consider for the classification. We can set our k as 1, 3, 5 or more.
Note: it is recommended to select an odd number for k to avoid a situation where you may have equal number of neighbors, like 3 red circles and 3 green ones if your k=6. In such case your unidentified item may remain as unknown forever.
3. What / Where is the nearest neighbor?
Well, so far so good. But how do we estimate the “nearest”, in other words the distance between unknown data item and its neighbors to identify the closest items? For that reason we have several options to calculate the distance, by options I mean mathematical formulas:
- Euclidean Distance
- Manhattan distance
- Hamming Distance
- Cosine Similarity
Let’s pick Euclidean Distance. I will bring here a little intro to this formula:
Consider a 2D Cartesian coordinates, where we want to measure the distance between two points: A(x1,y1) and B(x2, y2), which is the length of the path connecting them. Then the length of this path will be given by the Pythagorean theorem:
Thus, using Euclidean Distance formula, we can measure nearest neighbors to the target item. Let’s say, our k = 1, then we measure neighbors of the unknown item and pick 1 closest neighbor from among them for the classification.
4. Classification: identifying the class of an unknown data item
As soon as we find k number of closest neighbors using our distance formula (i.e. Euclidean Distance), we can classify the unknown data item. In case k = 3, we pick 3 closest neighbors and if at least 2 of them is classified as, for example, blue circles, then we classify the unknown data item as a blue one too.
We can see in the next example. Consider a table with circle coordinates and their colors respectively
circle color{6,3} blue
{7,4} blue
{3,4} red
{1,4} red
{7,4} unidentified
We want to identify the color of the circle located on {7,4) in our Cartesian space. As we decide to identify its color based on its 3 nearest neighbors (i.e. k=3), we need to calculate the distance between our target data point and its neighbors using Euclidean distance formula and highlight 3 shortest ones among them.
circle color distance{6,3} blue sqrt((7-6)^2 + (4-3)^2)≈ 1.4
{5,7} blue sqrt((7-5)^2 + (4-7)^2)≈ 3.6
{3,4} red sqrt((7-3)^2 + (4-4)^2)= 4
{1,4} red sqrt((7-1)^2 + (4-4)^2)= 6
{7,4} unidentified
As we can see, the closest points to the target are 2 blue circles located on {6,3} and {5,7} respectively and 1 red circle located on {3,4}. As the majority of circles are “blue”, the target circle will be identified as “blue circle” too.
I hope this article will be useful for you. I may update it and add more information about k-NN in the near future.
Next topic: Identification trees / Decision trees
Suggested literature and useful sources:
- “Classification and K-Nearest Neighbours”, Hiroshi Shimodaira
2. “Introduction to k Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction”, Oliver Sutton
3. “Introduction to Learning, Nearest Neighbors”, MIT Open Course Ware
4. “How Geometry, Data and Neighbors Predict Your Favorite Movies”, Patrick Honner, QuantaMagazine