How and when does the Decision tree stop splitting?

By default Splitting will stop when the tree reaches 100% purity, means when the child/subset node has homogeneous/single class or we can also say when child/subset node is pure(means all classes will be either Yes or No), this will lead to overfitting problem.

In simple when my algorithm learned everything from my training data, It will not be able to predict the new set of data, which we will called Overfitting.So splitting criteria is very important in case of Decision tree.

For example: You want to watch a movie, but the audio from the recording has picked up  more noise, so you can hear people clapping, whistling, and laughing. In fact, all of this noise is so distracting that it has caused you stop the video and search for another one. We can say this unnecessary and unwanted noise is a good example for overfitting,when we are recording a movie with our camera/mic it is recording each and every sound that is produced in the movie theatre, our mic is recording everything which is learning everything/sound in the theatre, which leads you to stop watching the movie

Below are the few of the options to overcome from Overfitting.

  1. Pruning
  2. Gain Ratio
  3. statistical significance test (Chi Square)

Pruning:

        We know pruning means cutting/trimming overgrown or dead part of a tree,We are going to do the same to overcome the overfitting problem by cutting down the overgrown tree.

        We have two ways of pruning the the trees,

  1. Post Pruning(we can prune the tree once the decision tree is completely build/grown)
  2. Pre Pruning(We can prune when the tree is growing)

        We will discuss more on this part latter.

Gain Ratio:

        We know the default stopping criteria of decision tree is based on information gain, until the node gets pure till that point splitting will happen which will leads to overfitting.By reducing information gain baias we can overcome overfitting.

        Information gain : It works fine for most cases, unless you have a few variables that have a large number of values (or classes).

Information gain is biased towards choosing attributes with a large number of values as root nodes.

         Gain ratio : This is a modification of information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split.It corrects information gain by taking the intrinsic information of a split into account.

statistical significance test:

        This method is being used when decision tree is being constructed, to determine if information gain is significant or not.It is a test that is run every time a split occurs and essentially checks: ” When a split happens, could it be due to chance?”

        If it appears that the split is likely due to chance, the information gained from it is not significant. However, if it appears that the split is not likely due to chance, the information gained from it is significant.

Published by viswateja3

Hi

One thought on “How and when does the Decision tree stop splitting?

Leave a reply to FACT Cancel reply