Class mismatch is also termed class/data Imbalance.
Let’s say that I am offering you a free drink which I claim is both sweet and tangy but each time you take a drink, it always tastes sweet while it only tastes tangy once or twice. Given this scenario when your friend asks about the taste of your drink, then it is obvious you will say it is sweet. This is the problem with imbalance class and it becomes more important in scenarios like claim fraud or credit fraud detection or disease screening or spam mails.
Suppose to train a machine learning model to discern non-spam emails from spam emails. The entire dataset is composed of 44 emails, including 40 non-spam emails and 4 spam emails. The model used is a standard algorithm and doesn’t take into account the class distribution. The result achieved is the following:
Here, the class “Non-Spam” is called the majority class, and the much smaller in size “Spam” class is called the minority class.
The model obtains 90.9% of accuracy. Great result! But is it a good model? Obviously not! The model acts like a Zero Rule model: only the majority class is found, while the rare class, which is more interesting, is ignored.
Accuracy evaluates all the classes as equally important and that’s why it can’t be used as a measure of goodness for models working on an imbalanced class dataset.
What is Class imbalance?
For a given classification problem, if the classes/targets within the dataset are not represented equally, i.e. When observation in one class is higher than the observation in other classes then there exists a class imbalance and then the dataset is said to be imbalanced.
The classes with a higher representation are called majority classes, while the ones with lower representation are called minority classes.
Balanced Data Imbalanced Data
Consider Orange color as a positive value and Blue color as a Negative value.
Balanced Data: In our data set we have positive values which are approximately the same as negative values. Then we can balance our dataset..
Imbalance Data: If there is a very high difference between the positive values and negative values. Then we can say our dataset in Imbalance Dataset.[
Sometimes when the records of a certain class are much more than the other class, our classifier may get biased towards the prediction.
In rare cases like fraud detection or disease prediction, it is vital to identify the minority classes correctly. So the model should not be biased to detect only the majority class but should give equal weight or importance towards the minority class too.
Here I discuss some of the few techniques which can deal with this problem. There is no right method or wrong method in this, different techniques work well wit:h different problems.
Handling the class imbalance problem:
The methods are widely known as ‘Sampling Methods’. Generally, these methods aim to modify an imbalanced data into balanced distribution using some mechanism. The modification occurs by altering the size of the original data set and providing the same proportion of balance.
These methods have acquired higher importance after many kinds of research have proved that balanced data results in improved overall classification performance compared to an imbalanced data set. Hence, it’s important to learn them.
Below are the methods used to treat imbalanced datasets