Missing Data Analysis with MICE

Outliers and missing values are the most important for any data science engineers need to deal with, we already discussed about outliers.

Before talking about how to deal with missing values, let’s talk about types of missing values.

  1. Missing at Random (MAR)
  2. Missing completely at random (MCAR)
  3. Missing not at Random (MNAR)

Let’s take one example, We are collecting salary information in hyderabad for survey for individual like Salary, years of experience, designation etc.. .

  1. Sometimes when we are asking questions we might miss asking one or two questions this is called  MCAR.
  2. Some of the people who are at manager/director or above positions most likely they don’t like to share the salary information but few of the people who are at that level will share the information this is called MAR  
  3. Some group of people may have very less income and they don’t want to share there income details this is called MNAR

As per my knowledge there are only two ways to deal with missing values

  1. Delete the missing values
  2. Fill missing values

Deleting the missing values:

  1. Deleting rows/list wise deletion
  2. Pairwise delete

Listwise delete:

         We will delete the rows which has one or more missing values,we can go for this option if we have less missing values and one of the easiest way to deal with missing values.  

        We should be very careful when we are deleting records using Listwise, from our above example in the case of MNAR, few people are not interested to share their salary information because they salary is very less, now if we delete those observations, we are losing some important information for the particular group whose income is very less.

Pairwise delete:

        This is better when compare to Listwise.First we will understand what is pairwise(occurring in pairs or two at a time). Now lets say if  I have two datasets as a={1,2} and b={4,5} now pair of a and b will be {(1,4),(1,5),(2,4),(2,5)}, so it is maximum combinations of two different datasets.

        In pairwise delete we will use correlation matrix to estimated on all data available for each successive pair of study variables.Random missingness and large samples may produce good estimates of population correlation matrix

Filling missing values:

        Filling missing values with some statistical methods is called imputation, we have two types of imputation as shown below

  1. Single imputation
  2. Multiple imputation

Single imputation:

        When we are replacing the missing values with any of the below statical method we called it single imputation.

  1. Mean imputation
  2. Median imputation
  3. Mode imputation

When to use above three types of imputation, it’s totally depends on your dataset.

  1. If your data is normally distributed then mean imputation is good option,
  2. if your data is skewed then median imputation is a good option,
  3. if your data is categorical then Mode imputation is the best option.

Multiple imputation:

        When we are replacing missing values based on the existing values with some prediction algorithms we called it Multiple imputation.

  1. If the missing value is continuous it uses regression algorithms
  2. If the missing values is categorical it uses classification algorithms.

Multivariate Imputation using Chained Equations (MICE) is one of the most used multiple imputation method.

We will take below example data set which has 9 observations and one missing value under Age and two missing values under salary

We can replace this missing values using mean or median or mode based on the conditions where we discussed in single imputation section.

Here we will try to impute the missing values using MICE.

Step by step process for MICE:

  1. Calculate the mean with the available data for the variables wherever there are missing values. In our case Age and Income.
  2. Replace the missing values with mean for all variables
  3. After step 2 we don’t have any missing values both in Age and Income.Now make one variable as dependent (predicted variable) either age or income, and restore it to the original state(with missing values)
  4. Now use classification or regression prediction algorithms based on the data (category or continues).
  5. Now predict the missing values and replace with the predicted value.
  6. Now repeat the step 3 to 5 for all other variables wherever there is a missing values.

For step 4 we already seen how to calculate Linear/logistic regression and decision trees, we can apply the same knowledge here.

We can also use KNN to find the missing values, you can find here how to calculate KNN.

Published by viswateja3

Hi

Leave a comment