Anomaly detection or outlier detection is identifying data points, events, or observations that deviate significantly from the majority of the data and do not follow a pre-defined notion of normal behavior. It is carried out to prevent fraud and to create a secure system or model. But before we talk about anomaly detection, let us first understand the concept of anomaly.
What is an anomaly?
An anomaly is a data point that is different from the standard data points. It is can be an exception or false information that is significantly different from the normal behavior.
There are three broad categories of anomalies:
When a data point has a value that is significantly outside all the other ranges of data points, it is called a global outlier. For example, if you normally withdraw 1000 rupees every day from the bank and suddenly you withdraw 1 lakh rupees, this would be an anomaly to the bank.
A contextual outlier is a data point whose value does not match the general trend for comparable data points. For example, if an analysis group determined that people who weigh 60 kg, who are men and who are between 20 and 30 years old, consume 1500 calories every day. Suppose there is a 60 kg person, 20-30 years old and a man who consumes 4500 calories, this is called contextual outlier.
Collective outliers are a subset or a sub-group of data points that deviate from the general data points. The example below explains collective outliers in a very easy way.
Every one of your neighbors moving out of the neighborhood on the same day is a collective outlier because although it’s definitely not rare that people move from one residence to the next, it is very unusual that an entire neighborhood relocates at the same time.Toward Data Science
What are the types of anomaly detection methods?
There are different kinds of anomaly detection methods with machine learning.
In supervised anomalies detection, you have a dataset where entries are labeled as abnormal and normal. A machine learning model is trained on this dataset. Now we don’t use it in practice because the datasets are huge. Normal and abnormal classes would be very unbalanced and labeling all points would be a tedious process.
This is the most common type of anomaly detection, and the most commonly used unsupervised algorithms are neural networks. We don’t need labeled data to detect anomalies in artificial neural networks. They can also be applied to unstructured data. We know that it is impossible to foresee all anomalies in every situation. For example, if you want to detect credit card fraud, you cannot train a model for all possible situations because there would be multiple scenarios and there is always a chance of new situations arising. The main problem with this method is that since uncontrolled algorithms, as the name suggests, are not controlled, we don’t know what kind of data points would be labeled as anomalies, it can classify valid data points that are difficult to correct. So techniques for detecting an unattended anomaly may be less accurate than those under supervision.
Semi-supervised anomaly detection is a combination of supervised and unsupervised methods. In this, you use unsupervised learning to classify the data points and then use human supervision to monitor what is being learned and correct it. This method is thus very accurate.
Machine learning algorithms for anomaly detection
Multiple machine learning algorithms can be used for anomaly detection depending on the dataset size and the type of the problem.
Local outlier factor
The local outlier factor is a fairly common technique for anomaly detection. In this algorithm, the local distance of the object is calculated from its n nearest neighbors. Thus the algorithm is based on the concept of the local density. If a data point has a lower density than its neighbors, then it is considered an outlier.
Although k-nearest neighbors or kNN is a supervised ML algorithm used for classification, it is an unsupervised algorithm for anomaly detection. When it is used for anomaly detection kNN does not perform any actual learning. You define the threshold values which determine if a data point is an outlier or not. A benefit of kNN is that it works well on both small and large datasets.
Support vector machines
A support vector machine (SVM) is also a supervised machine learning algorithm often used for classification. It is used for multi-class problems, but in anomaly detection, SVMs are used for one-class problems. We train the model so that it can learn what is normal behavior so that when it is given new data, it can classify it as normal or an anomaly. We choose a hyperparameter which is the threshold for outliers.
Short for Density-Based Spatial Clustering of Applications with Noise is an unsupervised ML algorithm that is density-based. It takes data points with multiple dimensions as inputs and creates clusters according to parameters like minimum samples and threshold. The points that do not belong to any cluster get their own class, so they are easy to identify as anomalies. This algorithm handles outliers even when the data is represented by non-discrete data points. Here is a really cool example about flight anomaly detection using DBSCAN.
Autoencoders use ANNs to encode data into smaller dimensions and then try to reconstruct the original data. When we get an outlier in the data, the encoder is not able to accurately reconstruct the data. Thus we can identify such a case as an outlier. In the image below, you can see that the distortion between the input image and the output image is calculated. This distortion tells us if the image is an anomaly or not since the output image cannot be reconstructed properly if the image is an anomaly.
Bayesian networks enable us to discover anomalies even in high-dimensional data. This method is used when the anomalies that we’re looking for are more subtle and harder to discover and visualizing them on the plot might not produce the desired results. It works for both discrete and continuous variables.
What is anomaly detection used for?
- Intrusion detection systems for cybersecurity
- Fraud detection by banks
- Health monitoring by machines
- Automatic defect detection of products
Some links in case you want to learn more about the topic
- Sheridan, Kevin & Puranik, Tejas & Mangortey, Eugene & Pinon, Olivia & Kirby, Michelle & Mavris, Dimitri. (2020). An Application of DBSCAN Clustering for Flight Anomaly Detection During the Approach Phase. 10.2514/6.2020-1851.
- Park, Seonho & Adosoglou, George & Pardalos, P.. (2020). Interpreting Rate-Distortion of Variational Autoencoder and Using Model Uncertainty for Anomaly Detection.
- Abu Sulayman, Iman & Ouda, Abdelkader. (2018). Data Analytics Methods for Anomaly Detection: Evolution and Recommendations. 1-4. 10.1109/CSPIS.2018.8642713.