4) Cross-validation to reduce Overfitting
Cross-validation (CV) is part 4 of our article on how to reduce overfitting. Its one of the techniques used to test the effectiveness of a machine learning model, it is also a resampling procedure used to evaluate a model if we have limited data. To perform CV we need to keep aside a sample/portion of the data which is not used to train the model, and later use this sample for testing/validating.
There are a lot of different techniques that may be used to cross-validate a model. Still, all of them have a similar algorithm:
- Divide the dataset into two parts: one for training, the other for testing
- Train the model on the training set
- Validate the model on the test set
- Repeat 1-3 steps a couple of times. This number depends on the CV method that you are using
The k-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the folds are used for the test set.
In K-fold cross-validation, K refers to the number of portions the dataset is divided into. K is selected based on the size of the dataset. The dataset is split into k portions one section is for testing and the rest for training.
Another section will be chosen for testing and the remaining section will be for training. This will continue K number of times until all sections have been used as a testing set once.
Let’s take the example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. In 1st iteration, the first fold is reserved for testing the model, and the rest are used to train the model. On 2nd iteration, the second fold is used to test the model, and the rest are used to train the model. This process will continue until each fold is not used for the test fold.
The steps for k-fold cross-validation are:
- Split the input dataset into K groups
- For each group:
- Take one group as the reserve or test data set.
- Use the remaining groups as the training dataset
- Fit the model on the training set and evaluate the performance of the model using the test set.
How to Determine the value of K – The first step in the process is determining the value of k. Choosing this value correctly should help you build models with low bias. Typically, k is set equal to 5 or 10. For example, with scikit learn, the default value of k is 5. This will give you 5 groups.
Another method for choosing k is to set it to n, where n is the data size. This is called Leave-One-Out, described in the next section.
Ultimately, the correct value of k is dependent on your dataset and the problem you’re trying to solve. A rule of thumb is to pick a value of k that ensures your train dataset is similar in distribution to the original dataset.
Variations of K-Fold- There are many variations of K-Fold, three of which are:
- Group K-Fold: This ensures that the same group is not represented in both testing and training sets.
- Stratified K-Fold: This preserves the percentage of samples of each class.
- Leave One Out: Here, we set the value of k to n. We will have only one data point in the validation dataset.
Beyond Learning, Start Practicing: Scikit learn has a cross-validation package to help you, it is a very simple process. Here are a few links you can refer to: