This method works with the majority class. It reduces the number of observations from the majority class to make the data set balanced.
When there exists a class that is in abundance, undersampling aims to reduce the size of the abundant class to balance the dataset. It uses all of the rare events but reduces the number of abundant events to create two equally sized classes.
This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles.
Here you take the majority class and try to create new samples that match the length of the minority samples.
Undersampling methods are of 2 types: Random and Informative.
a. Random Undersampling:
- Randomly delete examples in the majority class.
- under-sampling shrinks the data size, less time is necessary for learning.
The disadvantage is that discarding majority examples may lead to losing useful information of the majority class.
b. Informative undersampling:
It follows a pre-specified selection criterion to remove the observations from the majority class.
Within informative undersampling, EasyEnsemble and BalanceCascade algorithms are known to produce good results. These algorithms are easy to understand and straightforward too.
At first, it extracts several subsets of independent samples (with replacement) from the majority class. Then, it develops multiple classifiers based on the combination of each subset with the minority class.
As you see, it works just like an unsupervised learning algorithm.
It takes a supervised learning approach where it develops an ensemble of classifiers and systematically selects which majority class to the ensemble.
Do you see any problem with undersampling methods? Removing observations may cause the training data to lose important information about the majority class.