Random Forest Using Python | Classification, Regression, and Evaluation

How to apply Random Forest using Python? Well, we will use Python to implement a random forest algorithm.

Machine Learning (ML) is a method of data analysis that automates analytical model building. There are various types of Machine Learning, one of which is Supervised Machine Learning, in which the model is trained on historical data to make predictions. The Random Forest Algorithm is a type of Supervised Machine Learning algorithm that builds decision trees on different samples and takes their majority vote for classification and average in case of regression. In this article, we will discuss Random forests in more detail. We will solve both classification and regression problems using a random forest algorithm.

Before going to start the Random forest algorithm, make sure that you have a solid knowledge of Decision trees as they both are pretty much similar.

What is a Random Forest Using Python?

The concept of the Random Forest Algorithm is very similar to ensemble learning. Ensemble learning is a general meta-approach in Machine Learning that seeks better predictive performance by combining the predictions from multiple models. In simple words, It involves fitting many different model types on the same data and using another model to learn the best way to combine the predictions. So, Random Forest Algorithm combines predictions from decision trees and selects the best prediction among those trees.

We can define Random Forest as a classifier that contains some decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the algorithm takes the prediction from each tree, based on the majority votes of predictions, and forecasts the final output.

random-forest-using-python-for-classification-and-regression-random-fores

Advantages of Random Forest Algorithm

There are many advantages of using the Random Forest Algorithm. Some of these are listed below:

  • It takes less training time as compared to other algorithms.
  • It makes accurate predictions and runs efficiently.
  • It predicts output with high accuracy, even for a large dataset.
  • It can also maintain accuracy when a large proportion of data is missing.
  • It does not suffer from the overfitting problem because it takes the average of all the predictions, which cancels out the biases.
  • The algorithm can be used in both classification and regression problems.
  • We can get the relative feature importance using Random Forest Algorithm, which helps in selecting the most contributing features for the classifier.

How Random Forest Works?

The Random Forest Algorithm builds different decision trees on a randomly selected dataset and takes one of the decision trees based on the majority voting. The Random Forest Algorithm consists of the following steps:

  • Random data selection – the algorithm selects random samples from the provided dataset.
  • Building decision trees – the algorithm creates a decision tree for each selected sample.
  • Get a prediction result from each of created decision trees.
  • Perform voting for every predicted result.
  • Select the most voted prediction result as the final prediction.

Solving a Classification Problem Using Random Forest Using Python

Now, we will use random forests to make predictions for the classification dataset. A dataset having categorical values as output is known as a classification dataset. You can access to the dataset and the source code from my GithHub link. Here we will explain, each part of the coding in detail.

Before going to the implementation part, make sure that you have installed the following Python modules as we will be using them for the implementation.

  • sklearn
  • pandas
  • NumPy
  • matplotlib
  • seaborn
  • plotly

You can use the pip command to install the required modules on your system.

Importing and Exploring the Dataset

Let us first import the dataset and print a few rows using the pandas module.

# importing pandas
import pandas as pd

# improting dataset
data = pd.read_csv('dataset.csv')

# heading of dataset
data.head()

Output:

svm for classification

As you can see, we have two input columns and one output column. The input columns are about the age and salary of a person and the target class is whether that person purchased the product or not.

Let us now see the total number of people who purchased and who didn’t by plotting a bar plot.

# importing the required modules for data visualization
import matplotlib.pyplot as plt
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.offline as pyoff



# counting the total output data from purchased column
target_balance = data['Purchased'].value_counts().reset_index()

# dividing the output classes into two sections
target_class = go.Bar(
    name = 'Target Balance',
    x = ['Not-Purchased', 'Purchased'],
    y = target_balance['Purchased']
)

# ploting the output classes
fig = go.Figure(target_class)
pyoff.iplot(fig)

Output:

random-forest-for-classification-and-regression.-bar-plot

As you can see we have nearly 180 people who didn’t purchase the item and nearly 145 people had purchased it.

Now, let us also use the box plot to find the distribution of the input variables. A box plot is a simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines on either side of the rectangle.

For example, below are two box plots showing the distribution of age attributes of who purchased the item and who didn’t.

# Importing the plotly module
import plotly.express as pt

# ploting box graph ( age and success)
pt.box(data["Age"], color=data["Purchased"])

Output:

random-forest-for-classification-and-regression-box-plot-of-age

As you can see, old people are more likely to buy the item than young people.

Let us also plot the box plot of the salary based on the target variable.

# Importing the plotly module
import plotly.express as pt

# ploting box graph ( age and success)
pt.box(data["Salary"], color=data["Purchased"])

Output:

random-forest-for-classification-and-regression-box-plot

We can clearly see that the mean salary of people who purchased the item is much higher than the mean salary of people who don’t.

Splitting the Dataset

Before splitting the dataset into testing and training parts, we will divide the dataset into input and output values.

# dividing the dataset
X = data.drop('Purchased', axis=1)
y = data['Purchased']

Now, we will split the dataset into testing and training parts. We will also assign a value of 0 to the random state.

# importing the train_test_split method from sklearn
from sklearn.model_selection import train_test_split

# splitting the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

Once the splitting is complete, we can then go to the training of the model.

Training and Testing the Model

Let us first import the random forest classifier from the sklearn module and then train in on the training dataset.

# import Random forest using python
from sklearn.ensemble import RandomForestClassifier

# instantiate Random forest using python 
classifier = RandomForestClassifier()

# fit Random forest using python
classifier.fit(X_train, y_train)

Once the training is complete, we can then move to make predictions by using the testing dataset.

# making predictions for Random forest using python
y_pred = classifier.predict(X_test)

The predictions are now stored in a variable named y_pred, but we don’t know how accurate these predictions are. So, we will use various evaluation matrices to evaluate the performance of the random forest classifier.

Evaluating Random Forest Classifier

Let us first use the confusion matrix to see how accurate the predictions are. Every value in the main diagonal of the confusion matrix shows the correctly classified categories.

# importing seaborn
import seaborn as sns

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix

# providing actual and predicted values
cm = confusion_matrix(y_test, y_pred)

# If True, write the data value in each cell
sns.heatmap(cm,annot=True)

Output:

random-forest-for-classification-and-regression-confusion-matrix

As you can see, most of the values are in the main diagonal which shows the correct classifications.

Now, we will also print out the accuracy score of the model as well.

# importing accuracy score
from sklearn.metrics import accuracy_score

# printing the accuracy score
accuracy_score(y_test,y_pred)

Output:

random-forest-for-classification-and-regression-accuracy-score

This shows that our model was able to classify 82% of the testing data correctly.

Random Forest Using Python For Regression Dataset

Now, we will use the random forest to predict the continuous regression. We will use a dataset about house predictions based on the number of rooms, floors, area, and location. You can download the dataset and the Jupyter Notebook from my GitHub link.

Importing and Exploring a Dataset

Let us now use the pandas module to import the dataset and print out a few rows.

# importing the pandas module
import pandas as pd

# importing dataset
dataset = pd.read_csv('house.csv')

# printing
dataset.head()

Output:

random-forest-for-classification-and-regression-dataset-heading

As you can see, there are null values, so let us first remove those null values.

# removing null values
dataset.dropna(inplace=True)

Now, will plot the three-dimensional plot of the location vs prices of the houses.

# importing the module
import plotly.express as px

# creating 3-d graph
fig = px.scatter_3d(
    dataset, x='latitude', y='longitude', z='price', color=dataset['price'],
)
fig.show()

Output:

random-forest-for-classification-and-regressrion-3d-plot

As you can see, there is a trend in the price and the price is high at a specific location.

Splitting Dataset

Now, we will split the dataset. First, we will divide the dataset into input values and the target variable.

# dividing the dataset
X = dataset.drop('price', axis=1)
y = dataset['price']

We will now split the dataset into testing and training parts so that we can use the training part to train the model and then the testing part to evaluate the performance of the model. We will use sklearn module to split the dataset.

# importing the train_test_split method from sklearn
from sklearn.model_selection import train_test_split
# splitting the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

As you can see, we have assigned 30% of the data to the testing and 70% to the training part.

Training and Testing the Regressor

Let us now import the random forest regressor from the sklearn module and train the model.

# import Random forest using python
from sklearn.ensemble import RandomForestRegressor

# instantiate Random forest using python
regressor = RandomForestRegressor()

# fit Random forest using python model
regressor.fit(X_train, y_train)

Once the training is complete, we can then use the testing data to make predictions.

# making predictions for Random forest using python
y_pred = regressor.predict(X_test)

Now, we can go to the evaluation part to evaluate the model’s predictions.

Evaluation of Random Forest Regressor

Let us first plot the actual values and predicted values to see how close they are. This time, we will use matplotlib module to plot the graph.

# importing the module
import matplotlib.pyplot as plt

# fitting the size of the plot
plt.figure(figsize=(20, 8))

# plotting the graphs
plt.plot([i for i in range(len(y_test))],y_test, label="actual values")
plt.plot([i for i in range(len(y_test))],y_pred, label="Predicted values")

# showing the plotting
plt.legend()
plt.show()

Output:

random-forest-for-classificationa-and-regression-predictions-vs-actual-values

As you can see, the predictions seem to be close enough to the actual values.

Let us also find the R-square score of the model.

# Importing the required module
from sklearn.metrics import r2_score

# Evaluating the model
print('R score is :', r2_score(y_test, y_pred))

Output:

0.313

As you can see, we get an R-score value of 0.313

Summary

Random Forest is a commonly-used Machine Learning algorithm that combines the output of multiple decision trees to reach a single result. In this article, we learned how to use random forest on regression and classification datasets and evaluate the models.

Scroll to Top