NLP Basics

Prerequisites: Install NLTK using pip install nltk We will see the below basic Natural Language processing topic in this article Tokenization Stop words Stemming Lemmatization Tokenization: Tokenization is the process in which a sequence of words is broken into pieces as words. Again we have two parts of Tokenization Word tokenization Sentence Tokenization Word tokenization:Continue reading “NLP Basics”

Simple Moving Average

Moving average is also called Simple Moving Average(SMA) is widely used technique to find the direction of the trend form the past data.It is widely used for forecasting long term trends. We will calculate moving average for three years with the below data set. Three years moving average for the above data set means weContinue reading “Simple Moving Average”

Stratified sampling

Imbalanced data is one of the major issue in classification problem. Why we will have imbalanced data? Let’s say if i have 100 customer who is holding credit card, may be maximum I may have 2 or 3% defaulters and remaining 95 to 97% are perfect payers (This is called presence of minority class ),Continue reading “Stratified sampling”

Feature selection

In real time we will have lot of variables/features and some of the variables might carry same information(like age and date of birth),some of the variables like firstName, LastName etc.. which wont have any values during model building, so we need to remove the variables and this process we called it Feature selection. Let’s takeContinue reading “Feature selection”

Measure of impurity

In a given  dataset that contains class for the predicted/dependent variable (like Yes,No,Neutral etc..), we can measure homogeneity or heterogeneity of the table based on the classes. We say a dataset is pure or homogeneous if it contains only a single class(either YES or NO). If a dataset contains several classes, then we say that theContinue reading “Measure of impurity”

Bias-Variance and trade off

Bias and Variance is the common words which we will hear in Machine Learning. Bias: we will see this issue more in parameter Machine Learning algorithms, because most of the parameter algorithms are liner/polynomial, so It will not touch/read all data points, so we will have more error, which leads to under fitting.If you seeContinue reading “Bias-Variance and trade off”

Missing Data Analysis with MICE

Outliers and missing values are the most important for any data science engineers need to deal with, we already discussed about outliers. Before talking about how to deal with missing values, let’s talk about types of missing values. Missing at Random (MAR) Missing completely at random (MCAR) Missing not at Random (MNAR) Let’s take one example,Continue reading “Missing Data Analysis with MICE”

Kendall Rank Correlation

Rank correlation is when two variables are ranked the change in one shows the same/positive/negative change in another rank when we measure it across two points. Don’t worry if you still don’t understand, we will find Kendall rank correlation using below dataset. We are trying to see if there is any correlation if size ofContinue reading “Kendall Rank Correlation”

Chi Square

We know correlation is used to check the relation between two continuous variables,We should also have some kind of mechanism to check the relation between two categorical variables,and that is Chi-Square. Steps to check the relation between two categorical variables: Define hypothesis Define alpha Find out the Degree of freedom Define the rule Calculate theContinue reading “Chi Square”

Linear discriminant analysis

Before we talk about linear discriminant analysis we will have a quick look on disadvantages of Logistic regression.  Two-Class Problems. Logistic regression is intended for two-class or binary classification problems. It can be extended for multiclass classification, but is rarely used for this purpose.  Unstable With Well Separated Classes. Logistic regression can become unstable when the classes are wellContinue reading “Linear discriminant analysis”