Imbalanced data is one of the major issue in classification problem. Why we will have imbalanced data? Let’s say if i have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ),
During the splitting of data for training and testing, we know basic thumb rule is 70:30 (it can be any ratio) means 70% for training and remaining 30% is for testing the model we built.
Let’s say we have 100 observations out of 90 are good customers and remaining 10 are bad customers.If I split the data for training and testing there might be a chance in the training data I might not get the details of bad customer, so we will endup building the model with using only good customer, now when we are testing with test data our model will definitely won’t predict the correct value as it is never trained with bad customer sampel data.
So we have to be very specific when we are splitting data for training and testing to make sure both the training and testing data will have all classes (like good,bad etc..)
The method of splitting data equally in a way both training and testing data will get all classes.
Lets see how stratified splitting works with the below data.We have 20 good customers and 5 bad customers.
I want to split the data 80% and 20% as test and train which means 20 observations for training and 5 observations for testing,
If we are not using Stratified splitting the output might looks like below,I shown only three scenarios here but there are many possible scenarios at the end we need 80% of the data,
Scenario 2: If you see the above screenshot scenario 2, all the 5 observations related to bad customer went under 80% (training data), so there is nothing to left for test data
Scenario 3: All good customer falls under 80% of the training data, so here we don’t have any bad customer data which is available for training and when testing we have only bad customer data where our model will fail as we did not trained our model with bad customers.
Now if we use stratified splitting for our case with 80% training data, it will take 80% data form good customer and 80% data from bad customer as shown below.