KNN with an example

 A simple example to understand the intuition behind KNN:


The table consists of the height, age, and weight(target) values for 10 peoples. As you can see, the weight value of  ID 11 is missing. We need to predict weight of this person based on their height and age.

From the graph ID 11 is near points will gives the range prediction of ID 11 weight.

How does the KNN algorithm works :

It is classification problem, we know that the algorithm uses 'feature similarity' to predict values of any new data points.

1. First, the distance between the new point and each training point is calculated.

2. The closest K data points are selected (based on the distance). In this example 1,5,6 will be selected if value of k is 3.

3. The average of these data points is the final prediction for the new point i.e ID 11

      ID 11 = (77+72+60)/3 = 69.66 kg.

Methods of calculating distance between points :

There are various methods for calculating this distance, of which the most commonly used methods are:

  •  Euclidean Distance
  • Manhattan Distance (for continuous)
  • Hamming Distance (for categorical)

Euclidean Distance: 

 Euclidean Distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).

Manhattan Distance:

This is the distance between real vectors using the sum of their absolute difference

Hamming Distance:

It is used for categorical variables. If the value (x) and the value (y) are same, the distance D will be equal to 0, otherwise D = 1.
Once the distance of a new observation from the points in our training set has been measured, the next step is to pick the closest points. The number of points to be considered, is defined by the value of K.

In our example, K = 3 

ID

Height

Age

Weight

1

5

45

77

5

4.8

40

72

6

5.8

36

60



 ID 11 = (77+72+60)/3 = 69.66 kg.

For the value K = 5, the closest points will be ID 1, ID 4, ID 5, ID 6, ID 10.

ID

Height

Age

Weight

1

5

45

77

4

5.9

34

59

5

4.8

40

72

6

5.8

36

60

10

5.6

32

58


ID 11 = (77+59+72+60+58)/5 = 65.2 kg

We notice that, based on the K value, the final result tends to change. Then how can we figure out the optimum value of K ? Let us decide it, based on the error calculation for our train and validation set.
Have a look at the below graphs for training error and validation error, for different values of K.



This value of  'K' is the optimum value of the model (it will vary for different datasets) This curve is known as an 'Elbow curve'( because of the shape like elbow) and is usually to determine the K-value.

For a very low value of K (suppose K=1), the model over fits on the training data, which leads to a high error rate on the validation set. On the other hand, for a high value of K, the model performs poorly on both train and validation set. If you observe closely, the validation error curve reaches minima at a value of k = 9
 


Comments