Imbalanced Data Classification
Most of data in the real-word are imbalance in nature. Imbalanced class distribution is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. This happens because Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes.
Accuracy Paradox Accuracy Paradox is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.
Overview
There are options and approaches to deal with imbalanced dataset:
- Collect More Data
- Not applicable in many cases.
- Change Performance Metrics For Choosing the Model
- Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
- ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
- Resample Dataset
- Use under-sampling or over-sampling or both to even-up the classes.
- Undersampling techniques includes T-Link, Nearmiss, …
- Oversampling techniques includes cluster-based method, generating synthetic samples (SMOTE), …
- Penalized Models (Cost-Sensitive Learning)
- Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training.
- These penalties can bias the model to pay more attention to the minority class.
- For example, Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification. (Note that setting up the penalty matrix can be complex.)
- Algorithmic Ensemble Techniques
- Modify existing classification algorithms to make them appropriate for imbalanced data sets.
- Bootstrap Aggregating: Generating multiple different bootstrap training samples with replacement. And training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end.
Here we are going to focus on Resampling Techniques since it is widely used for tackling class imbalance problem.
UnderSampling Techniques
Undersampling aims to balance class distribution by eliminating the number of majority class examples.
Random Under-Sampling
Random Undersampling (RUS) aims to balance class distribution by randomly eliminating majority class examples.
Advantage of this method is:
- Help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.
Disadvantages of this method are:
- It can discard potentially useful information which could be important for building rule classifiers.
- The sample chosen by random under sampling may be a biased sample.
Clustering based under-sampling
A clustering technique is employed to resample the original training set into a smaller set of representative training examples, represented by weighted cluster centers or selective samples in each clusters.
Advantage of this method is:
- Compared to random under-sampling, cluster-based under-sampling can effectively avoid the important information loss of majority class.
Paper Related:
- Cluster-based majority under-sampling approaches for class imbalance learning
- Learning pattern classification tasks with imbalanced data sets
- A supervised learning approach for imbalanced data sets
Tomek Link (T-Link)
Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class.
Let x be an instance of class A and y an instance of class B. Let d(x, y) be the distance between x and y.
(x,y) is a T−Link, if for any instance z,d(x,y)<d(x,z) or d(x,y)<d(y,z)If any two examples are T-Link then one of these examples is a noise or otherwise both examples are located on the boundary of the classes. T-Link method can be used as a method of guided undersampling where the observations from the majority class are removed.
Paper Related:
Document Related:
Nearmiss Method
“NearMiss-1” selects the majority class samples whose average distances to three closest minority class samples are the smallest.
“NearMiss-2” selects the majority class samples whose average distances to three farthest minority class samples are the smallest.
“NearMiss-3” take out a given number of the closest majority class samples for each minority class sample.
Document Related:
One-Sided Selection
One-Sided Selection removes noise examples, borderline examples, and redundent examples.
The majority samples are roughly divided into 4 groups:
- Noise examples: The examples that are extremely close to any minority example.
- Borderline examples: The examples that are close to the boundary between positive and negative regions. These examples are unreliable.
- Redundent examples: The examples whose part can be taken over by other examples, which means they can not provide any other useful information.
- Safe examples: The examples that are worth being kept for future classification tasks.
Paper Related:
Document Related:
OverSampling Techniques
Oversampling aims to balance class distribution by increasing the number of minority class examples.
Random Over-Sampling
Random Over-Sampling (ROS) increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.
Advantages of this method are:
- Unlike under sampling this method leads to no information loss.
- Outperforms under sampling
Disadvantage of this method is:
- It increases the likelihood of overfitting since it replicates the minority class events.
Cluster-Based Oversampling
In this case, the K-means clustering algorithm is independently applied to minority and majority class instances. This is to identify clusters in the dataset. Subsequently, each cluster is oversampled such that all clusters of the same class have an equal number of instances and all classes have the same size.
In 2004, Cluster-based oversampling (CBOS) is proposed by Jo & Japkowicz. CBOS attempts to even out the between-class imbalance as well as the within-class imbalance. There may be subsets of the examples of one class that are isolated in the feature-space from other examples of the same class, creating a within-class imbalance. Small subsets of isolated examples are called small disjuncts. Small disjuncts often cause degraded classifier performance, and CBOS aims to eliminate them without removing data.
Paper Related:
- Concept-learning in the presence of between-class and within-class imbalances
- Class imbalances versus small disjuncts
Synthetic Minority Oversampling Technique (SMOTE)
SMOTE produces synthetic minority class samples by selecting some of the nearest minority neighbors of a minority sample which is named S, and generates new minority class samples along the lines between S and each nearest minority neighbor.
Paper Related:
Document Related:
The adaptive synthetic sampling approach (ADASYN)
ADASYN algorithm builds on the methodology of SMOTE. It uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.
The key idea of ADASYN algorithm is to use a density distribution ri as a criterion to automatically decide the number of synthetic samples that need to be generated for each minority data example.
Paper Related:
Document Related:
References
- How to handle Imbalanced Classification Problems in machine learning?
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Oversampling and undersampling in data analysis
- Learning pattern classification tasks with imbalanced data sets
- Cluster-based under-sampling approaches for imbalanced data distributions
- 數據嗨客 第6期:不平衡數據處理
- Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
- Experimental Perspectives on Learning from Imbalanced Data
- A Multiple Resampling Method for Learning from Imbalanced Data Sets