An extreme value or low value compared to other observations is called Outliers or the observation that is very far from other observations.
Parametric Machine Learning Algorithms are very sensitive with Outliers. But why? The first person who get affected with outliers is mean, all parametric machine learning will use mean,so It will impact our predictions.
Let’s say we are calculating the average salary in Bangalore, so we collected some sample from different companies. We collected two different sample one of the sample includes the salary of CEO and the other don’t.See the below image you will see the difference in mean between two samples.
Hope you are clear the effect of outliers.
Reasons for Outliers presence:
There are multiple reasons, listed below.
- Humans error during data entry
- Malfunction of the measuring equipment
- Data extraction or data transformation errors
- Natural/real outliers
- Sampling errors
Whatever the reasons of outliers, we need to deal with it very carefully.
Types of Outliers:
- Univarient
- Multi /Bivariant
Univarient: we already seen with example of salary.
In case of Multi/Bivariate, we need to plot the multi dimensional graph to find the outliers.
Let’s take the below data set, which has experience and Salary, if you see in individual variables we don’t see any outliers.But when you see them with some relation like salary based on experience, check the 4th observation the person with 2 years of experience has 80k salary, which is a outlier when compared to others years of experience.
If you see the below scatter plot the highlighted one is an outlier
How to detect outliers:
For small observations it is easy to identify the outliers, but if we have very huge amount of data, how do we do it?
visualization :
We can use below mentioned visualization techniques to identify visually.
- Box plot
- Histogram
- Scatter plot
Statistics: