Handle Imbalanced Data Using 7 Ways in Machine Learning

It is always a concern to handle imbalanced data when dealing with Machine Learning algorithms using Python. In this article, we will learn 7 different ways to handle imbalanced data in machine learning using Python. We will use 1. Random undersampling, 2. Random over-sampling, 3. Imblearn module, 4. Tomek links, 5. minority oversampling, 6. near miss, and 7. cost-sensitive. We use the Python language and implement various methods to handle imbalanced data in machine learning. Moreover, we will also learn why it is important to have balanced data while training the model.

Introduction to Imbalanced Data in Machine Learning

A dataset with skewed class proportion is called an imbalanced dataset. For example, if we have a dataset about dogs and cats and there are 10000 images of dogs and only 100 images of cats then the data is considered to be imbalanced data. The predictions on such datasets will be very biased as the model will be interacting more with one class while training.

7-ways-to-handle-imbalanced-data-in-machine-learning

Why Do We Need Balanced Data in Machine Learning?

Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important, and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

Another thing is that, with so few positives relative to negatives, the training model will spend most of its time on negative examples and not learn enough from positive ones.

Also, if we have a binary classification with 90% of data belonging to one class and the remaining 10% to another. Then with random prediction, we will be able to predict the output of the class with an accuracy of 90%.

Different Methods to Handle Imbalanced Data in Machine Learning

In this section, we will look at various methods to handle imbalanced data in machine learning. For demonstration purposes, we will be using imbalanced data from Kaggle.

Let us first import the dataset and get familiar with it before applying different methods to handle imbalanced data in Machine learning.

# importing pandas
import pandas as pd
# loading dataset
data = pd.read_csv('data.csv')
# heading
data.head()

Output:

hanlde-imbalanced-data-in-machine-learning-dataset

As you can see that we have a dataset and the target class has two output values. Let us plot a bar chart to see if the output category is balanced or not.

# importing the seaborn module
import seaborn as sns
# plotting the bar charts
sns.countplot(data = data, x = data['Class'])

Output:

handle-imbalanced-data-in-machine-learning-bar-charts

As you can see, the data is highly imbalanced. Now, let us jump into different methods to solve the imbalanced data.

Resampling Techniques

Resampling is one of the most widely techniques used to handle imbalanced data in Machine learning. It consists of removing samples from the majority class or adding more observations to the minority class.

Let us first divide the dataset into different classes based on the output values.

# counting output classes
class_0, class_1 = data['Class'].value_counts()
# printing classes
print(class_0)
print(class_1)

Output:

2581
492

In the next step, we will store the dataset in two different variables each containing the same output class.

# two different datasets
class_0 = data[data['Class'] == 0]
# second dataset
class_1 = data[data['Class'] == 1]
# printing the shape of the classes
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)

Output:

class 0: (2581, 32)
class 1: (492, 32)

Now, we have two datasets, each containing a different class as output.

The dataset is ready to apply different methods to handle the imbalanced data.

Method 1: Undersampling Technique

The random undersampling technique is simply removing some observations from the majority class randomly. The observations are removed until the majority class becomes balanced with the minority one.

handle-imbalanced-data-in-machine-learning-undersampling-technique

Now let us apply the undersampling method to our dataset and reduce the majority class to equalize it to the minority class.

# undersampling the majority class
class_0_under = Class_0.sample(class_1)
# concatinating the dataset
test_under = pd.concat([class_0_under, Class_1], axis=0)
# printing the shape of the classes
# two different datasets
Class_0 = test_under[test_under['Class'] == 0]
# second dataset
Class_1 = test_under[test_under['Class'] == 1]
# printing the shape of the classes
print('class 0:', Class_0.shape)
print('class 1:', Class_1.shape)

Output:

class 0: (492, 32)
class 1: (492, 32)

As you can see, now the data is perfectly balanced. Let us now visualize the output classes as well.

# plotting the bar charts
sns.countplot(data = test_under, x = test_under['Class'])

Output:

hanlde-imbalanced-data-in-machine-learning-under-sampling

As you can see, we have reduced the majority class and equalized it to the minority class by removing random data from the majority class.

Method 2: Oversampling Technique

The oversampling technique is simply adding copies of the minority class to itself to make it near to the majority class. This technique is useful when we have fewer values in the minority class and not tons of values in the majority class.

handle-imbalanced-data-in-machine-learning-over-sampling

Let us now apply the technique in Python.

class_count_0, class_count_1 = data['Class'].value_counts()
# Separate class
class_0 = data[data['Class'] == 0]
class_1 = data[data['Class'] == 1]
class_1_over = class_1.sample(class_count_0, replace=True)
test_over = pd.concat([class_1_over, class_0], axis=0)
# two different datasets
Class_0 = test_over[test_over['Class'] == 0]
# second dataset
Class_1 = test_over[test_over['Class'] == 1]
# printing the shape of the classes
print('class 0:', Class_0.shape)
print('class 1:', Class_1.shape)

Output:

class 0: (2581, 32)
class 1: (2581, 32)

As you can see, we have increased the minority class. Let us now visualize the output classes as well.

# plotting the bar charts
sns.countplot(data = test_over, x = test_over['Class'])

Output:

handle-imbalanced-data-in-python-over-sampling

As you can see, we have increased the minority class to make it equal to the majority class.

Method 3: Using Imblearn Module for Undersampling

Imbalanced-learn also known as imblearn is an open-source, MIT-licensed library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes. We can use this library for oversampling and undersampling.

Let us first use this module for undersampling. For demonstration purposes, we will first create a random classification dataset and then will use imblearn module for undersampling.

# inporting the modules
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# creaating a random classifition dataset
X, y = make_classification(n_classes=2, class_sep=2,
 weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1200, random_state=10)
# plotting bar charts
plt.bar([1,0],[np.count_nonzero(y == 0),np.count_nonzero(y == 1)])

Output:

hanlde-imbalanced-data-in-machine-learning-imblearn-undersampling

Now let us apply the undersampling method and handle the imbalanced data.

# importing the random under sampler
from imblearn.under_sampling import RandomUnderSampler
# initializing the random sampler
rus = RandomUnderSampler(random_state=42, replacement=True)
# fitting x and y
x_rus, y_rus = rus.fit_resample(X, y)
# checking the orignal and resamples shaped
print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))

Output:

original dataset shape: Counter({1: 1080, 0: 120})
Resample dataset shape Counter({0: 120, 1: 120})

As you can see, now the data is balanced.

Method 4: Using Imblearn Module for Oversampling

Similarly, we can also use the imblearn module for oversampling as well. We will use the same data which we have created above for oversampling.

# importing the module
from imblearn.over_sampling import RandomOverSampler
# initializing the random sample
ros = RandomOverSampler(random_state=0)
# fitting the data
x_ros, y_ros = ros.fit_resample(X, y)
print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Output:

Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})

As you can see that the data is now oversampled.

Method 5: Using Tomek Links

Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. In other words, they are pairs of opposing instances that are very close together. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.

Let us apply the Tomek link method for undersampling.

# importing the tomeklinks
from imblearn.under_sampling import TomekLinks
# initializing the model
tl = RandomOverSampler(sampling_strategy='majority')
# fit predictor and target variable
x_tl, y_tl = ros.fit_resample(X, y)
print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Output:

Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})

As you can see, now we have a balanced dataset.

Method 6: Synthetic Minority Oversampling Technique

The synthetic minority oversampling technique is also known as SMOTE is a technique to up-sample the minority classes while avoiding overfitting. It does this by generating new synthetic examples close to the other points (belonging to the minority class) in the feature space.

Let us now apply the SMOTE method to handle imbalanced data in machine learning.

# import the smote library
from imblearn.over_sampling import SMOTE
# initializing the smote
smote = SMOTE()
# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(X, y)
# printing
print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Output:

Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})

As you can see, the data is now balanced.

Method 7: NearMiss Technique

NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes.

Let us now apply the NearMiss technique to handle imbalanced data in machine learning

# importing the module
from imblearn.under_sampling import NearMiss
# initializing nearmiss
nm = NearMiss()
# training the model
x_nm, y_nm = nm.fit_resample(X, y)
# handle imbalanced data in machine learning
print('Original dataset shape:', Counter(y))
print('Resample dataset shape:', Counter(y_nm))

Output:

Original dataset shape: Counter({1: 1080, 0: 120})
Resample dataset shape: Counter({0: 120, 1: 120})

As you can see, the majority class has been converted to the minority class.

Summary

Balancing a dataset makes training a model easier because it helps prevent the model from becoming biased towards one class. In other words, the model will no longer favor the majority class just because it contains more data. So, it is important to have a balanced dataset.

In this article, we learned various methods through which we can balance the dataset. We covered 7 ways to handle imbalanced data in machine learning.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top