Suppose we have 10 students in the class and you want to find which students are similar?
Now how do we find this? may be based on height, color, marks score by subject or overall score and so on….
Based on the above common points, we can say student A and B is similar in terms of overall score or height or color depends on the parameters we choose.
Not all students exactly score same score, Same height etc.. but how do we find the similarity? with the help of distance.
Let’s say 3 students scored as below and we want to find the similar students.
Viswa score 10,Teja scored 9, and satish score 6.
Find the difference between the score of each student with others.
viswa score – Teja score = 10-9=1
Viswa score – Satish score = 10-6=4
Teja score – satish score = 9-6=3
Now if you see the difference between viswa and Teja is very less compare to satish, so in our case viswa and Teja is similar in nature in terms of their score. This difference we called it the distance.
There are two major use cases to identify the similarities.
One of the widely used use case in statistics to identify the similarity between groups/variables , below are the widely used methods to find this out and we already covered those.
- Pearson correlation
- Kendall correlation
- spearman’s correlation
- Chi-Squared
- ANOVA
- T student test
Now coming to our Machine learning in case of supervised learning we do have similar problems mainly in classification problemas, if we want to find the person who will be the defaulter in this month, We will check the person with his closet person who is similar to him and we will decide if he is gonna be a defaulter or not based on his similar person.
K-nearest neighbour is used to identify the similar/neighbour to solve this kind of problems, which uses distance calculation technique to identify the similar/close/neighbour, here K means number of closet/similar/neighbour.
Below are the widely used distance calculation techniques.
- Euclidean distance
- Manhattan distance