Borderline-SMOTE01
Published:
Relevant Paper: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
In this post I’m going to write about one of the variants of SMOTE algorithm - that is borderline-SMOTE01. You can find my previous post related to SMOTE here.
Basically, borderline-SMOTE01 was developed to improve the performance of SMOTE. This means that borderline-SMOTE01’s core algorithm is derived from SMOTE’s algorithm. The only difference lies on the source of the creation of synthetic samples. In SMOTE, synthetic samples are created from every original minority samples. Meanwhile, in borderline-SMOTE01, synthetic samples are created from minority samples resided on and nearby the borderline (we’ll call them with borderline samples).
The primary rationale behind generating new samples from borderline samples is borderline samples are more crucial in the process of classification than the ones far from the borderline. In other words, borderline samples are more inclined to be misclassified. Therefore, it’s important to strengthen the borderline samples.
Alright, let’s delve into the algorithm!
Step 1. For each minority sample (let’s call it as min_s), calculate its m nearest neighbors (including the majority samples). The number of majority samples in its m nearest neighbors is denoted by maj_num in which 0 <= maj_num <= m
Step 2. In this step we’ll specify whether min_s is a borderline sample using the following conditional rules:
I. If maj_num = m, then all the nearest neighbors of min_s are majority samples. In this case, min_s is considered as noise and will not be processed in the following steps.
II. If m / 2 <= maj_num <= m, then the majority samples occupy more than 50% of the nearest neighbors of min_s (the number of majority samples is larger than the number of minority samples). In this case, min_s is proned to be misclassified and therefore considered to be included in the list of borderline samples (borderline_samples).
III. If 0 <= maj_num < m / 2, then min_s is not considered to be the borderline sample. In this case, min_s will not be processed further.
Step 3. For each minority sample (b_sample) in borderline_samples, do the followings:
a) Find its k nearest neighbors (only includes the minority samples)
b) Randomly select x nearest neighbors from its k nearest neighbors (1 <= x <= k)
c) For each sample in the x nearest neighbors (x_nearest):
c.1) Calculate the difference between b_sample and x_nearest (diff_b_sample_x). The output is in the form of feature vector
c.2) Multiple diff_b_sample_x with a random number between 0 and 1. Suppose the output of this multiplication step is multi_diff_randnum
c.3) Add multi_diff_randum to b_sample
I hope this post might give you a basic understanding of borderline-SMOTE01 algorithm. Feel free to comment if you found any irrelevant or missing information.