Before we discuss about Bagging and Random forest we have to understand about Bootstrap sample.
Bagging:
Is also called bootstrap aggregator it gives best accuracy than decision tree and to reduce the variance. Bagging is very easy when you know how Decision tree and bootstrap sample works.It will use the greedy search algorithms like Entropy, Gini, information gain etc.. only difference is it will create multiple trees.
Bagging has only one parameter which is number of trees.
Random Forest:
Is a little bit enhancement over bagging, instead of considering all variables to identify the best split it will select the random variables and among them it will select the best split using Entropy or Gini.
Let’s say I have six variables like name,age,gender,income,city,DOB If I want to find the best split using Bagging it will apply Gini or Entropy for all the variable and will find the best split in case of Random forest it will select couple of variables randomly lets say out of six variables it will select three variables randomly and will search for the best split using Gini or Entropy, that why it is called Random Forest 🙂 .
Due to the randomness in best split selection Random forest is to avoid the correlation between the multiple Decision trees in the same forest when two Decision tree is carrying same information then it will affect the final prediction.
Random forest has two parameters one is number of trees and the second parameter is number of features need to select randomly for best split.
The number of features that can be searched at each split point (S) randomly must be specified as a parameter to the algorithm. You can try different values and also you can find best value using cross validation.
Below are the thumb rule for best value
- For classification S = sqrt(p)
- For regression S = p/3
Where S is the number of randomly selected features that can be searched at a split point and p is the number of input variables. For example, if a dataset had 25 input variables for a classification problem, then:
- S = sqrt(25)
- S = 5
Please go thru my blogs on Decision Trees for clear understanding of how Random Forest works, If you understand Decision Trees works the same way Bagging and Random Forest works as we know instead of one tree we will have multiple trees .Only difference is we will build more Decision trees with the sample data created by Bootstrap sample method.
So Let’s assume we built Random forest using below dataset and I want to build three Random decision trees
Predictions using Random Forest:
Now I want to predict if I can play the game or not when it the outlook is Sunny, temperature is cool, normal humidity and no wind.
We have three decision trees and lets say two of the trees gave the result as YES and one tree as NO, Now the final answer is YES as it will take the majority voting.
- For classification It will make the majority voting
- For Regression it will go the average of the outcome as for regression the outcome is continuous
Now as two decision trees says YES, we are considering we can play a game when the outlook is Sunny, temperature is cool, normal humidity and no wind.
Accuracy:
As per the Bootstrap sampling method Typically about 1/3 or 33% of the original data does not end up in the bootstrap dataset, means each bootstrap sample will have only 73% of the original data available and the rest 33% won’t be available, the remaining 33% will be the data with replacement.
Which means each of the decision tree will be trained using only 73% or ⅔ % of the data which means ⅓ % or 33% of the data is unknown to each decision trees in the Random forest.
We will call each bootstrap sample as a bag, and as we know each bag will have only 73% of the data and the remaining 33% of the data we will call it Out-of-Bag data for the sample.
Now we need to test this 33% of the data also called Out-of-Bag data with the Decision tree and see if it classifies correctly or not (Each decision tree will have one Out-of-Bag sample if we have three decision trees we will have three Out-of-Bag sample).
- We can measure how accuracy is our Random forest by proportion of Out-of-Bag samples that are correctly classified by our Random Forest.
- The proportion of Out-of-Bag sample that were wrongly predicted is called Out-of-Bag error.