It is always a concern to handle imbalanced data when dealing with Machine Learning algorithms using Python. In this article, we will learn 7 different ways to handle imbalanced data in machine learning using Python. We will use 1. Random undersampling, 2. Random over-sampling, 3. Imblearn module, 4. Tomek links, 5. minority oversampling, 6. near miss, and 7. cost-sensitive. We use the Python language and implement various methods to handle imbalanced data in machine learning. Moreover, we will also learn why it is important to have balanced data while training the model.
Introduction to Imbalanced Data in Machine Learning
A dataset with skewed class proportion is called an imbalanced dataset. For example, if we have a dataset about dogs and cats and there are 10000 images of dogs and only 100 images of cats then the data is considered to be imbalanced data. The predictions on such datasets will be very biased as the model will be interacting more with one class while training.

Why Do We Need Balanced Data in Machine Learning?
Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class. This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important, and therefore the problem is more sensitive to classification errors for the minority class than the majority class.
Another thing is that, with so few positives relative to negatives, the training model will spend most of its time on negative examples and not learn enough from positive ones.
Also, if we have a binary classification with 90% of data belonging to one class and the remaining 10% to another. Then with random prediction, we will be able to predict the output of the class with an accuracy of 90%.
Different Methods to Handle Imbalanced Data in Machine Learning
In this section, we will look at various methods to handle imbalanced data in machine learning. For demonstration purposes, we will be using imbalanced data from Kaggle.
Let us first import the dataset and get familiar with it before applying different methods to handle imbalanced data in Machine learning.
# importing pandas import pandas as pd # loading dataset data = pd.read_csv('data.csv') # heading data.head()
Output:

As you can see that we have a dataset and the target class has two output values. Let us plot a bar chart to see if the output category is balanced or not.
# importing the seaborn module import seaborn as sns # plotting the bar charts sns.countplot(data = data, x = data['Class'])
Output:

As you can see, the data is highly imbalanced. Now, let us jump into different methods to solve the imbalanced data.
Resampling Techniques
Resampling is one of the most widely techniques used to handle imbalanced data in Machine learning. It consists of removing samples from the majority class or adding more observations to the minority class.
Let us first divide the dataset into different classes based on the output values.
# counting output classes class_0, class_1 = data['Class'].value_counts() # printing classes print(class_0) print(class_1)
Output:
2581
492
In the next step, we will store the dataset in two different variables each containing the same output class.
# two different datasets class_0 = data[data['Class'] == 0] # second dataset class_1 = data[data['Class'] == 1] # printing the shape of the classes print('class 0:', class_0.shape) print('class 1:', class_1.shape)
Output:
class 0: (2581, 32)
class 1: (492, 32)
Now, we have two datasets, each containing a different class as output.
The dataset is ready to apply different methods to handle the imbalanced data.
Method 1: Undersampling Technique
The random undersampling technique is simply removing some observations from the majority class randomly. The observations are removed until the majority class becomes balanced with the minority one.

Now let us apply the undersampling method to our dataset and reduce the majority class to equalize it to the minority class.
# undersampling the majority class class_0_under = Class_0.sample(class_1) # concatinating the dataset test_under = pd.concat([class_0_under, Class_1], axis=0) # printing the shape of the classes # two different datasets Class_0 = test_under[test_under['Class'] == 0] # second dataset Class_1 = test_under[test_under['Class'] == 1] # printing the shape of the classes print('class 0:', Class_0.shape) print('class 1:', Class_1.shape)
Output:
class 0: (492, 32)
class 1: (492, 32)
As you can see, now the data is perfectly balanced. Let us now visualize the output classes as well.
# plotting the bar charts sns.countplot(data = test_under, x = test_under['Class'])
Output:

As you can see, we have reduced the majority class and equalized it to the minority class by removing random data from the majority class.
Method 2: Oversampling Technique
The oversampling technique is simply adding copies of the minority class to itself to make it near to the majority class. This technique is useful when we have fewer values in the minority class and not tons of values in the majority class.

Let us now apply the technique in Python.
class_count_0, class_count_1 = data['Class'].value_counts() # Separate class class_0 = data[data['Class'] == 0] class_1 = data[data['Class'] == 1] class_1_over = class_1.sample(class_count_0, replace=True) test_over = pd.concat([class_1_over, class_0], axis=0) # two different datasets Class_0 = test_over[test_over['Class'] == 0] # second dataset Class_1 = test_over[test_over['Class'] == 1] # printing the shape of the classes print('class 0:', Class_0.shape) print('class 1:', Class_1.shape)
Output:
class 0: (2581, 32)
class 1: (2581, 32)
As you can see, we have increased the minority class. Let us now visualize the output classes as well.
# plotting the bar charts sns.countplot(data = test_over, x = test_over['Class'])
Output:

As you can see, we have increased the minority class to make it equal to the majority class.
Method 3: Using Imblearn Module for Undersampling
Imbalanced-learn also known as imblearn is an open-source, MIT-licensed library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes. We can use this library for oversampling and undersampling.
Let us first use this module for undersampling. For demonstration purposes, we will first create a random classification dataset and then will use imblearn module for undersampling.
# inporting the modules from sklearn.datasets import make_classification import matplotlib.pyplot as plt # creaating a random classifition dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1200, random_state=10) # plotting bar charts plt.bar([1,0],[np.count_nonzero(y == 0),np.count_nonzero(y == 1)])
Output:

Now let us apply the undersampling method and handle the imbalanced data.
# importing the random under sampler from imblearn.under_sampling import RandomUnderSampler # initializing the random sampler rus = RandomUnderSampler(random_state=42, replacement=True) # fitting x and y x_rus, y_rus = rus.fit_resample(X, y) # checking the orignal and resamples shaped print('original dataset shape:', Counter(y)) print('Resample dataset shape', Counter(y_rus))
Output:
original dataset shape: Counter({1: 1080, 0: 120})
Resample dataset shape Counter({0: 120, 1: 120})
As you can see, now the data is balanced.
Method 4: Using Imblearn Module for Oversampling
Similarly, we can also use the imblearn module for oversampling as well. We will use the same data which we have created above for oversampling.
# importing the module from imblearn.over_sampling import RandomOverSampler # initializing the random sample ros = RandomOverSampler(random_state=0) # fitting the data x_ros, y_ros = ros.fit_resample(X, y) print('Original dataset shape', Counter(y)) print('Resample dataset shape', Counter(y_ros))
Output:
Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})
As you can see that the data is now oversampled.
Method 5: Using Tomek Links
Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. In other words, they are pairs of opposing instances that are very close together. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.
Let us apply the Tomek link method for undersampling.
# importing the tomeklinks from imblearn.under_sampling import TomekLinks # initializing the model tl = RandomOverSampler(sampling_strategy='majority') # fit predictor and target variable x_tl, y_tl = ros.fit_resample(X, y) print('Original dataset shape', Counter(y)) print('Resample dataset shape', Counter(y_ros))
Output:
Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})
As you can see, now we have a balanced dataset.
Method 6: Synthetic Minority Oversampling Technique
The synthetic minority oversampling technique is also known as SMOTE is a technique to up-sample the minority classes while avoiding overfitting. It does this by generating new synthetic examples close to the other points (belonging to the minority class) in the feature space.
Let us now apply the SMOTE method to handle imbalanced data in machine learning.
# import the smote library from imblearn.over_sampling import SMOTE # initializing the smote smote = SMOTE() # fit predictor and target variable x_smote, y_smote = smote.fit_resample(X, y) # printing print('Original dataset shape', Counter(y)) print('Resample dataset shape', Counter(y_ros))
Output:
Original dataset shape Counter({1: 1080, 0: 120})
Resample dataset shape Counter({1: 1080, 0: 1080})
As you can see, the data is now balanced.
Method 7: NearMiss Technique
NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes.
Let us now apply the NearMiss technique to handle imbalanced data in machine learning
# importing the module from imblearn.under_sampling import NearMiss # initializing nearmiss nm = NearMiss() # training the model x_nm, y_nm = nm.fit_resample(X, y) # handle imbalanced data in machine learning print('Original dataset shape:', Counter(y)) print('Resample dataset shape:', Counter(y_nm))
Output:
Original dataset shape: Counter({1: 1080, 0: 120})
Resample dataset shape: Counter({0: 120, 1: 120})
As you can see, the majority class has been converted to the minority class.
Summary
Balancing a dataset makes training a model easier because it helps prevent the model from becoming biased towards one class. In other words, the model will no longer favor the majority class just because it contains more data. So, it is important to have a balanced dataset.
In this article, we learned various methods through which we can balance the dataset. We covered 7 ways to handle imbalanced data in machine learning.