Cross Validation

 Cross Validation:

In machine learning, we couldn't fit the model on the training data and can't say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purposes, we use the cross validation technique.

Cross validation is technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data set.

  1. Reserve some portion of  sample data-set.
  2. Using the rest data-set train the model
  3. Test the model using the reserve portion of the data-set. 

Types of Cross validation :
Holdout method:
It is a basic remedy for this involves removing a part of the training data and using it to get from the model trained on rest of data. The error estimation then tells how our model is doing on unseen data or the validation set. This is a simple kind of Cross Validation technique, also known as the Holdout method.
Limitation: It is still suffers from high variance. 

Leave one out cross validation(LOOCV):
This approach leaves one data point out of the training data i.e if there are 'n' data points in the original sample data set then n-1 sample are used to train the model and 1 point is used for validation set. This is repeated for all combinations in which original sample can be separated this way, and then the error is averaged for all trails, to give overall effectiveness.
For example 1000 instances in a data set then 999 will be our training and 1 is test set. This process is iterated as each instance is an test data.
Limitation:This approach takes more time on large data sets, and it is an outdated approach.

K- Fold Cross validation:
In K fold cross validation, the data divided into k subsets. Now the holdout method is repeated for 'k' times, such that each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. The effectiveness is take the mean of all k accuracies. The value of k is can take any number there is no fixed number(in general k= 5 or 10).
Example:
If there are 1000 instances in data set, take  k=5 then test set is 1000/5 = 200. Remaining 800 is training set. You will get accuracy for each k value. Then you take the mean accuracy for the overall      effectiveness of data. 
Limitation: K-Fold is not effective for imbalanced data set.

Stratified K-Fold cross validation :
In some cases, there may be large imbalanced in the response variables. For instance, if we observe  data set concerning price of houses, there might be large number of houses having hing price; or in case of classification, there might be several times more negative samples than positive samples. For such problems(imbalanced data set), a slight variation in the K fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This is known as Stratified K- fold.
By considering previous example of 1000 records, some times starting with a test of the first 200 records might not always a good idea. To deal with this problem we can use Stratified k-fold cross validation. In this we split data such that the proportions between classes are the same in each fold as they are in the whole data set. In the below image there is a comparison between K-Fold CV and Stratified CV. (consider k=3 i.e three classes)

Time Series Cross Validation:
Usually for Time like year, month, day, and particular time. Mostly used at stock marketing scenarios.

Comments