Data Processing – Data and Machine by viswateja

Hyperparameter tuning

Before going to this blog make sure you are aware about any one of the algorithms like Decision tree, Random forest.etc….Below are the few parameters that we need for most of the algorithms. Maximum number of leaves per tree Depth of trees. Number of trees in Random forest Learning rate L1 and L2 regularization weights.Continue reading “Hyperparameter tuning”

Data Tranformations

In the real time most of the variables are not normally distributed and most of the parametric statistics test(ANOVA,T test, Regression etc..) are based on the assumption that the data is normally distributed therefore it do not meet the assumptions of statistical tests if the data is not normally distributed,in this case the results willContinue reading “Data Tranformations”

Stratified sampling

Imbalanced data is one of the major issue in classification problem. Why we will have imbalanced data? Let’s say if i have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ),Continue reading “Stratified sampling”

Feature selection

In real time we will have lot of variables/features and some of the variables might carry same information(like age and date of birth),some of the variables like firstName, LastName etc.. which wont have any values during model building, so we need to remove the variables and this process we called it Feature selection. Let’s takeContinue reading “Feature selection”

Missing Data Analysis with MICE

Outliers and missing values are the most important for any data science engineers need to deal with, we already discussed about outliers. Before talking about how to deal with missing values, let’s talk about types of missing values. Missing at Random (MAR) Missing completely at random (MCAR) Missing not at Random (MNAR) Let’s take one example,Continue reading “Missing Data Analysis with MICE”

Kendall Rank Correlation

Rank correlation is when two variables are ranked the change in one shows the same/positive/negative change in another rank when we measure it across two points. Don’t worry if you still don’t understand, we will find Kendall rank correlation using below dataset. We are trying to see if there is any correlation if size ofContinue reading “Kendall Rank Correlation”

Chi Square

We know correlation is used to check the relation between two continuous variables,We should also have some kind of mechanism to check the relation between two categorical variables,and that is Chi-Square. Steps to check the relation between two categorical variables: Define hypothesis Define alpha Find out the Degree of freedom Define the rule Calculate theContinue reading “Chi Square”

Spearman Rank Correlation

Synthetic Minority Over-sampling Technique (SMOTE)

Imbalanced data is one of the main issue in classification problem. Why we will have imbalanced data? Let’s say if I have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ),Continue reading “Synthetic Minority Over-sampling Technique (SMOTE)”

Outliers treatment

An extreme value or low value compared to other observations is called Outliers or the observation that is very far from other observations. Parametric Machine Learning Algorithms are very sensitive with Outliers. But why? The first person who get affected with outliers is mean, all parametric machine learning will use mean,so It will impact ourContinue reading “Outliers treatment”