Imbalanced data is one of the main issue in classification problem. Why we will have imbalanced data? Let’s say if I have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ), now if we run any of the machine learning algorithm for predict the customer behavior, we might not be able to predict correctly because our model will be trained well with majority of the class, It might predict the majority class very well but not the minority class which will leads to Biased prediction and misleading accuracy.
How we will overcome this problem? We may have below two options.
- We need to reduce the observations related majority class (Under sampling)
- We need to increase the size of observations related to minority class (Over sampling)
Let’s say we have 100 observations out of 95 are good customers and remaining 5 are bad customers.
Under sampling:
We will randomly remove the observations from majority class.But discarded observation will definitely will have some important information, may leads to bias
Now from our above example we will reduce 95 good customers to 75 randomly
Oversampling:
We will randomly add the observation to minority class by duplicating or copying the existing observations.But newly added may carry redundancy information may leads to overfitting.
Now from our above example we will increase our 5 bad customers to 20.
The above two methods are really prone to under/over fitting problems, so we need a mechanism to overcome this problem, We use SMOTE for over sampling.
Oversampling is always a best option to add new observations rather than deleting the existing observations (Under sampling)which is carrying meaningful information
SMOTE:
Synthetic Minority Over sampling Technique (SMOTE) algoritham that internally uses K nearest neighbor (KNN).Click here to see how KNN works.
Let’s understand how SMOTE works internally, We have four minor class data points plotted as below.
Steps for SMOTE :
- We need to identify the feature vectors and its nearest neighbors
- Now take the difference between two data points and multiply the difference with the random number between 0 to 1
- Now identify the new point on the line by adding the random number to the feature vector.
- We need to repeat it as many times we want, In my case I need to add five synthetic minority data points to be added, so I repeated the steps five times and the final result will looks like below
Important parameters for SMOTE:
Number of nearest neighbors and SMOTE percentage are the two properties which is important to pass some valid parameters.
SMOTE percentage:
How many minority class observation I need to add,In our above example we had 5 bad questions if I say SMOTE percentage is 100 then adding the same number of minority class that were in the original dataset means it will add 5 observations, if I say 200% than 10 observations.
Number of nearest neighbors:
- By increasing the number of nearest neighbors, you get features from more cases.
- By keeping the number of nearest neighbors low, you use features that are more like those in the original sample.