- K-Nearest neighbor is one of the non-parametric supervised learning.
- There is no concept like model building training data it is a instance based learning.
- We can use KNN for both classification and regression problems.
- One thing is like more about KNN is we need to pass only one hyper parameters which is K(number of nearest neighbor).
- I can make use of different distance methods like Euclidean,Hamming,Manhattan etc…based on the types of data we have.
- It can also be used to fill the missing values.
Let’s also discuss some disadvantages of KNN.
- Testing is slow using KNN.
- We need to chose which distance method we need to use because KNN usually needs homogeneous data for example Euclidean is a good distance measure to use if the input variables are similar in type (e.g.all measured widths and heights). Manhattan distance is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc.), We need to test KNN with multiple distance matrices and with different K values option together to choose the best K value and distance matrices.
- It is sensitive to Outliers and imbalanced class and can’t deal with missing values.
- It is not suited for high dimensional data.
- For faster learning rate we need to standardise the data.
We will take below dataset and try to implement KNN for classification problem.Both X and X1 are generated using normal distribution function in excel sheet.we have equal number of binary class(five 1 and five 0)
In our example we will use euclidean distance formula is shown below, as discussed in above points we can also use other distance calculation methods based on the data.
Now we will try to find the value of Y when X = 2.342324 and X1=1.235894
The very first step we need to calculate the Euclidean distance between the new input with the all existing observations.
For better understanding how to calculate the distance formula we will take only below two observations and try to find the distance with our new values where X = 2.342324 and X1=1.235894
Step 1)First we will find the square difference between new X and X1 values with the above two observations.
Step 2) Now add the two squared difference the result will be 21.61834649
Step 3) Now find the for square root for the value which we got in step 2
So the euclidean distance for our scenario is 3.5520948.
Now I calculated the distance for every observation with X = 2.342324 and X1=1.235894 and below is the result
Now arrange the data from lowest to high values based on Distance and result is shown below.
If we take 3 nearest neighbor for our X = 2.342324 and X1=1.235894 then 0.99897796, 1.5868156, 2.24801801 as shown below, now we have two 1s and one 0, We need to take most occurrence value in our case it is 1.
So it is concluded the values of Y is 1 when X = 2.342324 and X1=1.235894
Note:
We should be very careful when selection K values,For example in our above case we have two class, If we took K values as 4 then there might be a chance of tie (two 1s and two 0s), that is the reason why I took K value as 3 in our case
2 thoughts on “K-Nearest Neighbor”