Feature selection

In real time we will have lot of variables/features and some of the variables might carry same information(like age and date of birth),some of the variables like firstName, LastName etc.. which wont have any values during model building, so we need to remove the variables and this process we called it Feature selection.

Let’s take an example of credit card approval information, there will be many variables like below

  1. Personal information like Name, age, gender, marital status etc..
  2. Financial information like Monthly income, additional income, credit history, existing loan information etc..
  3. Demographic informationlike Address, Race, Address type, etc..
  4. Application information like Date of application, campaign information etc..
  5. Education and employment information like Education level, types of education , employment, employer category, designation, years of experience etc..
  6. Co Applicant details again may have same kind of information which we seen from point 1 to 5

 Do you think we need to pass all this variables in to our model? The answer is absolutely NOOOO.

Because if we give good data we will get good result,If we send garbage data we will get garbage result.It also helps to reduce the time taken for training a model and gives good accuracy. That’s where the feature selection and dimension reduction comes into picture, which plays an important role in any Artificial intelligence.

We have three important methods for feature selection which are widely used.

  1. Filter based methods
  2. Wrapper methods
  3. Embedded methods

Filter based:

        This is one of the widely used feature selection, we run correlation between the variable and then will select which is best, below are the widely used correlation methods.

  1. Pearson correlation
  2. Spearman Rank correlation
  3. Kendall Rank correlation
  4. Chi square

We already discussed the above four mentioned methods please go to it first to understand some maths behind it.

Now the big question when to use this different types of correlation methods.

Pearson correlation: We will go for this methods when our both variables are normally distributed and has some linear relationship. It is parametric test

Spearman rank correlation: When  the data is not normally distributed and has no linear relationship we will go for spearman rank correlation,it is a non-parametric test. It used monotonic(varying in such a way that it either never decreases or never increases) function to define the relationship between two variables.

Kendall Rank correlation: It is also used when data is not normally distributed and has no linear relationship, it is also a non-parametric test.Then what makes difference with spearman rank, if we have a significant outliers in our dataset than Kendall will give you the best results.

Chi Square: This is used when we have categorical data.

Wrapper methods:

        When I am trying to understand this feature selection methods, fine tune parameter comes in to my mind.

        We will understand by with an example, let’s say I have variables like u,v,w,x,y,z and now I will build my model with random selection of variables        

        First I will build model with u,x,y,w and than with z,u,v,x and so onnnnn, now I will check the accuracy for all the models I built with random models, which ever model gives me the best accuracy(like MSE,RMSE,R square,ROC/AUC) I will select that model.

Embedded methods:

        This is one of the old and also widely used methods it is again divided into two parts

  1. Forward selection
  2. Backward selection.

Forward Selection:

        In this process we will start building model with one variable and and will add new variable in each iteration and we will check the accuracy of the model,if we found best accuracy then we will stop adding new variables

Backward Selection:

        The reverse process of forward selection, we will start building our model with all variables and then we will remove the variables one by one in each iteration.

Published by viswateja3

Hi

Leave a comment