Plot Outliers in Python in Machine Learning? Matplotlib

Today, we are going to discuss how we can detect, handle and plot outliers in Python in Machine learning using various techniques. Outliers are simply the anomalies in our dataset that deviate from the trend or from other data points. The presence of outliers in the dataset can affect the prediction and training process of Machine learning algorithms negatively. In this article, we will cover how we can identify outliers, and handle and plot outliers in Python in machine learning. We will use Python language as a coding language to implement some of the concepts.

Before going to learn how to detect the outliers in the dataset which can affect the training of Machine learning algorithms, we assume that you already have knowledge of machine learning algorithms and their implementation using Python on various datasets.

What are Outliers in Machine learning?

A data point that stands out from the others is called an outlier. The outliers show measurement mistakes, poor data collection, or simply show variables that were not taken into account during data gathering.

outliers-in-machine-learning-outlier-in-scatter-plot

If we will not handle outliers in Machine learning, they can have a negative effect on the training and predictions of the model. As the model considers the outliers as normal datasets and tries to fit the model on the anomalies.

Any object that is not with the trend is an anomaly or outlier. For example, if one person is riding motor cycler on a footpath where everyone is walking, then that person is an outlier as he/she is different from other people.

Types of Outliers in Machine Learning

Based on the behavior of the outliers in Machine learning, the outliers can be divided into three main groups.

  • Global outliers
  • Contextual Outliers
  • Collective Outliers

Global Outliers in Machine Learning

In the same way that “global variables” in a computer program can be accessed by any function in the program, a data point is deemed to be a global outlier if its value is far outside the bounds of the data collection in which it is discovered.

outliers-in-machine-learning-gloabal-outliers

Collective Outliers in Machine Learning

As the name implies, if a group of data points in a dataset considerably deviates from the rest of the dataset, they are said to be collective outliers. Individual data objects in this case might not act as outliers, but the group as a whole might. We may require baseline knowledge regarding the relationship between those data objects exhibiting the outlier behavior in order to discover these types of outliers.

outliers-in-machine-learning-collective-outliers

Contextual Outliers in Machine Learning

Another name for Contextual outliers is Conditional Outliers. Here, a data object in a particular dataset considerably differs from the other data points due to a single circumstance or situation. A data point could exhibit typical behavior under one environment but be an outlier under another. Therefore, in order to identify contextual outliers, a context must be included as part of the problem statement. Contextual outlier analysis gives users the freedom to look at outliers in various situations, which can be very useful in many applications. The environmental and behavioral attributes are used to determine the data point’s qualities.

outliers-in-machine-learning-contexual-outlier

Plot Outliers in Python in Machine Learning

Python provides many amazing tools and modules for plotting which can help us to visualize the outliers through various plots.

For example, a box plot can help us to visualize the distribution of the dataset and it can easily show the outliers in our dataset in visualized form.

We will first import the dataset which is about the price of houses. We will then use different visualization methods to visualize the outliers in the price of houses using Python.

import pandas as pd

# importing dataset 
data = pd.read_csv('house.csv')

# dropping the null values
data.dropna(inplace=True)

Now, let us first plot the box plot for the price of the houses.

# importing the module
import seaborn as sns


# plot outliers in Python
sns.boxplot( data=data['price'])
Plot outliers in python using matplotlib

As you can see, all the dotted points outside the box plot are outliers and this shows that there are many outliers in the dataset.

Another useful plot that helps to detect and handle outliers in Machine learning is using violin plots. Violin plots help to plot outliers in Python.

# setting the theme of the violin plot
sns.set_theme(style="whitegrid")

# plotting the violin plot outliers in Python
ax = sns.violinplot(x=data['price'])
machine-learning-violin-plots

The long tail shows that there are outliers in the dataset as the width of the tail is very thin.

Another simple way to detect and handle outliers in machine learning is using scattered or line plots. Let us now visualize outliers using a scattered plot.

# importing the module
import matplotlib.pyplot as plt

# plotting scattered plot outliers in Python
plt.scatter([i for i in range(len(data.price))], data.price , c ='m')
machine-learning-scatter-plot

As you can see, there are a few data points that are above most of the points which seem to be outliers. Let us also plot the same information using a line plot as well. The line plot can also be useful to plot outliers in Python.

# importing the module
import matplotlib.pyplot as plt

# plotting scattered plot to Visualize outliers using Python
plt.plot([i for i in range(len(data.price))], data.price , c ='m')
handle-outliers-in-machine-learning-line-plot

As you can see, the tall lines which do not follow the usual trend are outliers in our dataset.

How do Outliers Affect Machine Learning Models?

As we said earlier that outliers have a negative impact on the training and the predictions of the machine learning models. To understand this, we will use a simple learner regression model and understand visually how the outliers affect the machine learning model.

Let us assume that we have the following data.

# data
data = [[10, 20], [20, 30], [30, 40], [40, 50], [50, 60]]

# importing pandas
import pandas as pd

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['input', 'output'])

# datahead
df.head()
handle-outliers-in-dataset-creating-dataframe

Now, let us visualize the linear regression line.

# importing the numpy module
import numpy as np

# plotting the input
plt.plot(df.input, df.output, 'o')

# creatig linear line
m, b = np.polyfit(df.input, df.output, 1)

# plotting regression line /plot outliers in Python
plt.plot(df.input, m*df.input + b)
handle-outliers-in-machine-learning-regression-line

Now, let us introduce one outlier in the dataset and visualize the regression line to see how it affects the best-fitted line.

# importing the numpy module
import numpy as np

# plotting the input
plt.plot(df.input, df.output, 'o')

# creatig linear line
m, b = np.polyfit(df.input, df.output, 1)

# plotting regression line
plt.plot(df.input, m*df.input + b)
handle-outliers-in-machine-learning-outlier-effect

As you can see, the regression line becomes horizontal because of the one outlier.

Different Ways to Detect Outliers in Machine Learning

As we have seen, how outliers can affect the Machine learning model, so it is better to handle them properly. But going to handle outliers in machine learning, we should know if our dataset has outliers or not. Fortunately, there are various algorithms that can help us to detect outliers in machine learning.

  • Isolation forest
  • Local outlier factor

Isolation Forest

Isolation Forests is an unsupervised learning approach that uses the Decision Tree Algorithm to isolate outliers in the data and identify abnormalities. Selecting a feature at random from the available set of features and then choosing a split value between the maximum and minimum values, separates the outliers. The anomalous data points will have shorter trajectories in trees as a result of the random feature partitioning, making them stand out from the rest of the data.

handle-outliers-in-machine-learning-isolation-forest

Local Outlier Factor (LOF)

A local outlier factor (LOF) is an algorithm that identifies the outliers present in the dataset. Local outliers are points that are regarded as outliers based on their immediate surroundings. When taking into account the neighborhood’s density, LOF will detect an outlier. When the data density is not constant across the dataset, LOF performs well.

How to Handle Outliers in Machine Learning?

There are various techniques and methods to handle outliers in Machine learning. Some of these techniques need a little statistical knowledge as well because to handle outliers in Machine learning, we need support from statistics.

So, we will discuss the following techniques:

  • Standard Deviation Method
  • Interquartile Range Method

Standard Deviation Method

To understand the Standard Deviation method, you should first have knowledge about the Gaussian Distribution. A Gaussian distribution or normal is a distribution in which most of the data points are around the mean and median. As we go away from the median, the distribution or the probability of data points decreases.

  • 68% of the data set exists one standard deviation away from the mean
  • 95% of the data set exists two standard deviations away from the mean
  • 99.7% of the data set exists three standard deviations away from the mean

So, most any data points that are more than 3 standard deviations from the mean are considered to be an outlier.

Let us again use the dataset about the price of houses and find out the outliers in the price of houses. We will use 3 standard deviations as our cut-off point.

# importing modules
from numpy import mean
from numpy import std

# calculating mean and std
data_mean = mean(data.price)
data_std =  std(data.price)


# identify outliers after  3 std
cut_off = data_std * 3


lower = data_mean - cut_off
upper =data_mean + cut_off

As you can see, we have specified the range for the normal values.

# identify outliers in the price
outliers = [x for x in data.price if x < lower or x > upper]

# printing outliers
print("total outliers are :", len(outliers))
total outliers are : 47

As you can see, the Standard deviation method has detected 47 outliers in our dataset. Now, let us scatter plot the dataset with and without outliers.

# plotting whole dataset
plt.scatter(data.price.index, data.price)

# using for loop to detect the outliers
for i in data.price:
    if i in outliers:
        
#         plot outliers in Python
        plt.scatter(data.price[data.price == i].index[0], i, c='m')
plot outliers in python using various methods

As you can see, the purple points are the outliers detected by the standard deviation method. One of the simplest ways to handle outliers in machine learning is to remove them from our dataset, so let us remove the outliers from the dataset.

#Using drop() to delete rows based on column value
data.drop(data.price[data.price > lower].index, inplace = True)
data.drop(data.price[data.price < upper].index, inplace = True)

As you can see, we have dropped the outliers and now our dataset only contains the normal values.

Interquartile Range Method
In our real life, not all dataset is normally distributed so we cannot apply the Standard deviation method everywhere. So, to deal with the outliers in non-Gaussian distribution is the Interquartile range method.

If you are familiar with the box plot, it is the same as Interquartile Range. The IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot. So, any value outside the whiskers is treated as an outlier.

For example, see the box plot below:

# importing dataset 
data = pd.read_csv('house.csv')

# dropping the null values
data.dropna(inplace=True)

# plot plot to visualize outliers
sns.boxplot( data=data['price'])
outliers-in-machine-learning-box-plot-to-visualize-outliers

All the data points above the whiskers are outliers in our dataset.

Let us now calculate manually these outliers by using the Interquartile Range Method.

# importing the module
from numpy import percentile

# calculate interquartile range
q_25 = percentile(data.price, 25)
q_75 =  percentile(data.price, 75)

# calculating iterquartile range
iqr = q_75 - q_25


# calculate the outlier cutoff
cut_off = iqr * 1.5

# lower range and upper range
lower = q_25 - cut_off
upper = q_75 + cut_off

# identify outliers
outliers = [x for x in data.price if x < lower or x > upper]

# printing
print("total outliers :", len(outliers))
total outliers : 218

As you can see, this method detects 218 outliers in our dataset. Let us know and visualize these outliers.

# plotting whole dataset
plt.scatter(data.price.index, data.price)

# using for loop to detect the outliers
for i in data.price:
    if i in outliers:
        
#        plot outliers in Python
        plt.scatter(data.price[data.price == i].index[0], i, c='m')
handle-outliers-in-machine-learning-interqurtile-range-method

The purple points represent the outliers in our dataset.

We can simply remove these outliers in order to have a dataset without outliers. Let us remove the outliers.

#Using drop() to delete rows based on column value
data.drop(data.price[data.price > lower].index, inplace = True)
data.drop(data.price[data.price < upper].index, inplace = True)

As you can see, we have removed the outliers from our dataset.

NOTE: You can get access to the source code and the dataset from my GitHub. Please don’t forget to follow and give me a star.

Summary

Outliers are data points that differ from the usual trends. The presence of such data points in the training dataset can affect the training process of the model as the model will treat the outliers as normal values. However, there are various ways to handle outliers in Machine learning. In this article, we covered some of the different ways to detect, handle ad plot outliers in Python in machine learning.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top