Analysis of variance (ANOVA)

Analysis of variance (ANOVA) can determine whether the means of three or more groups are different. Example 1): Let’s say they are couple of colleges in your area and you want to know which college give the best performance(In this case all students took same exam from different colleges) Example 2): Lets say I have threeContinue reading “Analysis of variance (ANOVA)”

One way ANOVA calculation

Lets calculate one way ANOVA with the below dataset. Assumptions:                 Null Hypothesis =         H0: µ1=µ2=µ3                 Alternative Hypothesis= Ha: µ1!=µ2!=µ3 Calculate the Mean: Grand Mean: Mean of all sample means or mean for all observation from all samples Between Group Variability: When you see below image the two different samples isContinue reading “One way ANOVA calculation”

Data Tranformations

In the real time most of the variables are not normally distributed and most of the parametric statistics test(ANOVA,T test, Regression etc..) are based on the assumption that the data is normally distributed therefore it do not meet the assumptions of statistical tests if the data is not normally distributed,in this case the results willContinue reading “Data Tranformations”

NLP Basics

Prerequisites: Install NLTK using pip install nltk We will see the below basic Natural Language processing topic in this article Tokenization Stop words Stemming Lemmatization Tokenization: Tokenization is the process in which a sequence of words is broken into pieces as words. Again we have two parts of Tokenization Word tokenization Sentence Tokenization Word tokenization:Continue reading “NLP Basics”

Simple Moving Average

Moving average is also called Simple Moving Average(SMA) is widely used technique to find the direction of the trend form the past data.It is widely used for forecasting long term trends. We will calculate moving average for three years with the below data set. Three years moving average for the above data set means weContinue reading “Simple Moving Average”

Stratified sampling

Imbalanced data is one of the major issue in classification problem. Why we will have imbalanced data? Let’s say if i have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ),Continue reading “Stratified sampling”

Feature selection

In real time we will have lot of variables/features and some of the variables might carry same information(like age and date of birth),some of the variables like firstName, LastName etc.. which wont have any values during model building, so we need to remove the variables and this process we called it Feature selection. Let’s takeContinue reading “Feature selection”

Measure of impurity

In a given  dataset that contains class for the predicted/dependent variable (like Yes,No,Neutral etc..), we can measure homogeneity or heterogeneity of the table based on the classes. We say a dataset is pure or homogeneous if it contains only a single class(either YES or NO). If a dataset contains several classes, then we say that theContinue reading “Measure of impurity”

Bias-Variance and trade off

Bias and Variance is the common words which we will hear in Machine Learning. Bias: we will see this issue more in parameter Machine Learning algorithms, because most of the parameter algorithms are liner/polynomial, so It will not touch/read all data points, so we will have more error, which leads to under fitting.If you seeContinue reading “Bias-Variance and trade off”

Missing Data Analysis with MICE

Outliers and missing values are the most important for any data science engineers need to deal with, we already discussed about outliers. Before talking about how to deal with missing values, let’s talk about types of missing values. Missing at Random (MAR) Missing completely at random (MCAR) Missing not at Random (MNAR) Let’s take one example,Continue reading “Missing Data Analysis with MICE”