How to apply Random Forest using Python? Well, we will use Python to implement a random forest algorithm.
Machine Learning (ML) is a method of data analysis that automates analytical model building. There are various types of Machine Learning, one of which is Supervised Machine Learning, in which the model is trained on historical data to make predictions. The Random Forest Algorithm is a type of Supervised Machine Learning algorithm that builds decision trees on different samples and takes their majority vote for classification and average in case of regression. In this article, we will discuss Random forests in more detail. We will solve both classification and regression problems using a random forest algorithm.
Before going to start the Random forest algorithm, make sure that you have a solid knowledge of Decision trees as they both are pretty much similar.
What is a Random Forest Using Python?
The concept of the Random Forest Algorithm is very similar to ensemble learning. Ensemble learning is a general meta-approach in Machine Learning that seeks better predictive performance by combining the predictions from multiple models. In simple words, It involves fitting many different model types on the same data and using another model to learn the best way to combine the predictions. So, Random Forest Algorithm combines predictions from decision trees and selects the best prediction among those trees.
We can define Random Forest as a classifier that contains some decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the algorithm takes the prediction from each tree, based on the majority votes of predictions, and forecasts the final output.

Advantages of Random Forest Algorithm
There are many advantages of using the Random Forest Algorithm. Some of these are listed below:
- It takes less training time as compared to other algorithms.
- It makes accurate predictions and runs efficiently.
- It predicts output with high accuracy, even for a large dataset.
- It can also maintain accuracy when a large proportion of data is missing.
- It does not suffer from the overfitting problem because it takes the average of all the predictions, which cancels out the biases.
- The algorithm can be used in both classification and regression problems.
- We can get the relative feature importance using Random Forest Algorithm, which helps in selecting the most contributing features for the classifier.
How Random Forest Works?
The Random Forest Algorithm builds different decision trees on a randomly selected dataset and takes one of the decision trees based on the majority voting. The Random Forest Algorithm consists of the following steps:
- Random data selection – the algorithm selects random samples from the provided dataset.
- Building decision trees – the algorithm creates a decision tree for each selected sample.
- Get a prediction result from each of created decision trees.
- Perform voting for every predicted result.
- Select the most voted prediction result as the final prediction.
Solving a Classification Problem Using Random Forest Using Python
Now, we will use random forests to make predictions for the classification dataset. A dataset having categorical values as output is known as a classification dataset. You can access to the dataset and the source code from my GithHub link. Here we will explain, each part of the coding in detail.
Before going to the implementation part, make sure that you have installed the following Python modules as we will be using them for the implementation.
- sklearn
- pandas
- NumPy
- matplotlib
- seaborn
- plotly
You can use the pip command to install the required modules on your system.
Importing and Exploring the Dataset
Let us first import the dataset and print a few rows using the pandas
module.
# importing pandas
import pandas as pd
# improting dataset
data = pd.read_csv('dataset.csv')
# heading of dataset
data.head()
Output:

As you can see, we have two input columns and one output column. The input columns are about the age and salary of a person and the target class is whether that person purchased the product or not.
Let us now see the total number of people who purchased and who didn’t by plotting a bar plot.
# importing the required modules for data visualization
import matplotlib.pyplot as plt
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.offline as pyoff
# counting the total output data from purchased column
target_balance = data['Purchased'].value_counts().reset_index()
# dividing the output classes into two sections
target_class = go.Bar(
name = 'Target Balance',
x = ['Not-Purchased', 'Purchased'],
y = target_balance['Purchased']
)
# ploting the output classes
fig = go.Figure(target_class)
pyoff.iplot(fig)
Output:

As you can see we have nearly 180 people who didn’t purchase the item and nearly 145 people had purchased it.
Now, let us also use the box plot to find the distribution of the input variables. A box plot is a simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines on either side of the rectangle.
For example, below are two box plots showing the distribution of age attributes of who purchased the item and who didn’t.
# Importing the plotly module
import plotly.express as pt
# ploting box graph ( age and success)
pt.box(data["Age"], color=data["Purchased"])
Output:

As you can see, old people are more likely to buy the item than young people.
Let us also plot the box plot of the salary based on the target variable.
# Importing the plotly module
import plotly.express as pt
# ploting box graph ( age and success)
pt.box(data["Salary"], color=data["Purchased"])
Output:

We can clearly see that the mean salary of people who purchased the item is much higher than the mean salary of people who don’t.
Splitting the Dataset
Before splitting the dataset into testing and training parts, we will divide the dataset into input and output values.
# dividing the dataset
X = data.drop('Purchased', axis=1)
y = data['Purchased']
Now, we will split the dataset into testing and training parts. We will also assign a value of 0 to the random state.
# importing the train_test_split method from sklearn
from sklearn.model_selection import train_test_split
# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
Once the splitting is complete, we can then go to the training of the model.
Training and Testing the Model
Let us first import the random forest classifier from the sklearn
module and then train in on the training dataset.
# import Random forest using python
from sklearn.ensemble import RandomForestClassifier
# instantiate Random forest using python
classifier = RandomForestClassifier()
# fit Random forest using python
classifier.fit(X_train, y_train)
Once the training is complete, we can then move to make predictions by using the testing dataset.
# making predictions for Random forest using python
y_pred = classifier.predict(X_test)
The predictions are now stored in a variable named y_pred, but we don’t know how accurate these predictions are. So, we will use various evaluation matrices to evaluate the performance of the random forest classifier.
Evaluating Random Forest Classifier
Let us first use the confusion matrix to see how accurate the predictions are. Every value in the main diagonal of the confusion matrix shows the correctly classified categories.
# importing seaborn
import seaborn as sns
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
# providing actual and predicted values
cm = confusion_matrix(y_test, y_pred)
# If True, write the data value in each cell
sns.heatmap(cm,annot=True)
Output:

As you can see, most of the values are in the main diagonal which shows the correct classifications.
Now, we will also print out the accuracy score of the model as well.
# importing accuracy score
from sklearn.metrics import accuracy_score
# printing the accuracy score
accuracy_score(y_test,y_pred)
Output:

This shows that our model was able to classify 82% of the testing data correctly.
Random Forest Using Python For Regression Dataset
Now, we will use the random forest to predict the continuous regression. We will use a dataset about house predictions based on the number of rooms, floors, area, and location. You can download the dataset and the Jupyter Notebook from my GitHub link.
Importing and Exploring a Dataset
Let us now use the pandas
module to import the dataset and print out a few rows.
# importing the pandas module
import pandas as pd
# importing dataset
dataset = pd.read_csv('house.csv')
# printing
dataset.head()
Output:

As you can see, there are null values, so let us first remove those null values.
# removing null values
dataset.dropna(inplace=True)
Now, will plot the three-dimensional plot of the location vs prices of the houses.
# importing the module
import plotly.express as px
# creating 3-d graph
fig = px.scatter_3d(
dataset, x='latitude', y='longitude', z='price', color=dataset['price'],
)
fig.show()
Output:

As you can see, there is a trend in the price and the price is high at a specific location.
Splitting Dataset
Now, we will split the dataset. First, we will divide the dataset into input values and the target variable.
# dividing the dataset
X = dataset.drop('price', axis=1)
y = dataset['price']
We will now split the dataset into testing and training parts so that we can use the training part to train the model and then the testing part to evaluate the performance of the model. We will use sklearn
module to split the dataset.
# importing the train_test_split method from sklearn
from sklearn.model_selection import train_test_split
# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
As you can see, we have assigned 30% of the data to the testing and 70% to the training part.
Training and Testing the Regressor
Let us now import the random forest regressor from the sklearn module and train the model.
# import Random forest using python
from sklearn.ensemble import RandomForestRegressor
# instantiate Random forest using python
regressor = RandomForestRegressor()
# fit Random forest using python model
regressor.fit(X_train, y_train)
Once the training is complete, we can then use the testing data to make predictions.
# making predictions for Random forest using python
y_pred = regressor.predict(X_test)
Now, we can go to the evaluation part to evaluate the model’s predictions.
Evaluation of Random Forest Regressor
Let us first plot the actual values and predicted values to see how close they are. This time, we will use matplotlib
module to plot the graph.
# importing the module
import matplotlib.pyplot as plt
# fitting the size of the plot
plt.figure(figsize=(20, 8))
# plotting the graphs
plt.plot([i for i in range(len(y_test))],y_test, label="actual values")
plt.plot([i for i in range(len(y_test))],y_pred, label="Predicted values")
# showing the plotting
plt.legend()
plt.show()
Output:

As you can see, the predictions seem to be close enough to the actual values.
Let us also find the R-square score of the model.
# Importing the required module
from sklearn.metrics import r2_score
# Evaluating the model
print('R score is :', r2_score(y_test, y_pred))
Output:
0.313
As you can see, we get an R-score value of 0.313
Summary
Random Forest is a commonly-used Machine Learning algorithm that combines the output of multiple decision trees to reach a single result. In this article, we learned how to use random forest on regression and classification datasets and evaluate the models.
Pingback: Support vector machine ( SVM ) for classification
Pingback: Ada boost and hyperparameter tuning using Python
Pingback: LightGBM using Python - Hyperparameter tuning of LightGBM - TechFor-Today
Pingback: How to use isolation forest to detect outliers in machine learning - TechFor-Today
Pingback: How to use catboost in python: Hyperparameter tuning of catboost - TechFor-Today
Pingback: Introduction to Machine Learning:supervised and unsupervised
Pingback: Extra trees classifier and regressor using Python - TechFor-Today
Pingback: How to make predictions with machine learning models? - TechFor-Today
Pingback: List of top 20 Classification algorithms in Machine learning - TechFor-Today
Pingback: Sklearn feature selectors with examples - TechFor-Today
Pingback: How to Visualize a Random Forest in Python? -
Pingback: How to learn Python for Machine learning? -
Pingback: Why choose Python for Machine learning and data science? -
Pingback: Sklearn Feature Selector With Examples - Techfor-Today
Pingback: Isolation Forest to Detect Outliers and Visualize Outliers in Machine Learning - Techfor-Today
Pingback: SVM Classifier (Support Vector Machine) Using Sklearn - Techfor-Today