Measure of impurity

In a given dataset that contains class for the predicted/dependent variable (like Yes,No,Neutral etc..), we can measure homogeneity or heterogeneity of the table based on the classes.

We say a dataset is pure or homogeneous if it contains only a single class(either YES or NO).
If a dataset contains several classes, then we say that the table is impure or heterogeneous(Combination of YES and NO).

There are several ways to measure degree of impurity. Most well known ways to measures are given below .

Entropy
Gini index
Classification error

There respective formulas are given below.

In our below dataset we have two classes YES and NO and we have 9 YES and 5 NO out of 14 observations.

Let’s calculate the probability for the class YES and NO

For the above data set the the values for entropy, Gini and classification error as below

Entropy = 0.94
Gini = 0.46
Classification error = 0.36

Oh wondering how we got above values Lets do it hands on!

Entropy:

Gini:

Classification error:

In the above formula we took the max value as per the equation.

Measure of impurity

Oh wondering how we got above values Lets do it hands on!

Entropy:

Gini:

Classification error:

Published by viswateja3

Leave a comment Cancel reply

Oh wondering how we got above values Lets do it hands on!

Entropy:

Gini:

Classification error:

Share this:

Related

Published by viswateja3

Leave a comment Cancel reply