Extra trees classifier is used for the classification problems in machine learning and extra trees regressor is used for the regression problems in machine learning. The working of the extra trees classifier and regressor is pretty much close to the random forest algorithm. It is highly recommended that you should go through the Random forest algorithm and Decision trees algorithm, before starting the extra trees classifier and regressor in order to understand the concepts fully.
In simple words, the extra trees classifier and regressor work by randomly selecting a subset of features and then training the model using the decision tree. But the tree is then pruned so that it contains only the most important features for making predictions. In this article, we will discuss how the extra trees classifier and regressor work. We will also explain the difference between extra tree algorithms, decision trees, and random forests. Moreover, we will implement extra trees classifier and regressor on classification and regression problems respectively
How the Extra Trees Algorithm Work?
The extra trees algorithm is also known as Extreme Randomized Tree. It generates predictive models for classification and regression problems. It is similar to other approaches like decision trees and random forests, but it makes better predictions by using additional facts about the data. The extra tree algorithm is also quicker and simpler to use than others. As a result, it is an effective tool for predictive modeling and data mining.
Like the random forests technique, the extra trees algorithm generates a large number of decision trees, but each tree’s sampling is random and without replacement. This generates a dataset with distinct samples for each tree. For each tree, a predetermined amount of features are also randomly chosen from the entire set of features. The selection of a splitting value for a feature at random is the most significant and distinctive aspect of extra trees. The algorithm then chooses a split value at random rather than figuring out a locally optimal split using Gini or entropy. As a result, the trees are diverse and unrelated.
Extra Tees vs Random Forest algorithm
Although the extra trees algorithm is pretty much similar to the random forest algorithm, the only difference is the construction of the decision trees. The following are some of the main differences between extra trees and random forest algorithm:
- The extra trees algorithm uses the whole original dataset while the random forest uses bootstrap replicas.
- The next feature that differs in both algorithms is the selection of cut points to split the nodes. The random forest chooses the optimum split while the extra trees algorithm selects randomly.
Why to Choose the Extra Trees Classifier and Regressor Over the Random Forest?
Here are some of the features that give the extra trees algorithm more importance.
- The extra trees algorithm uses the whole original sample of data instead of using small portions.
- It chooses the nodes randomly which reduces variance.
- It is faster than the random forest algorithm as it does not spend any time splitting nodes.
- There are very fewer chances of the extra tree model being overfitted or under fitted as it reduces bias and variance due to randomness.
Extra Trees classifier using Python
Now we will use extra trees classifier to predict the flower type. In this section, we will use the well-known iris dataset which contains information about three different types of flowers. The data can be found in the submodule of sklearn module. We just need to load the data from there.
Let us first load the dataset.
# importing dataset from sklearn.datasets import load_iris # loading the data data= load_iris()
Learn more about iris data and explore it.
Training Extra Trees Classifier Using Python
Before training the extra trees classifier on the given dataset, we have to split the dataset into testing and training parts so that we can use the testing data later to evaluate the model. We will also assign 1 to a random state.
# splitting the data into inputs and outputs Input, output = load_iris(return_X_y=True) # importing the module from sklearn.model_selection import train_test_split # splitting the dataset X_train, X_test, y_train, y_test = train_test_split(Input, output, test_size=0.25, random_state =1)
As you can see, we have assigned 25% of the total dataset to the testing part, and the remaining 75% to the training dataset.
Let us now initialize the extra trees classifier and train the model on the training dataset.
# importing the module from sklearn.ensemble import ExtraTreesClassifier # initializing the model extra_classifier = ExtraTreesClassifier() # Training the model extra_classifier.fit(X_train, y_train)
Once the training is complete, we can then use the model to predict the output class using the testing dataset.
# making predictions y_pred = extra_classifier.predict(X_test)
Now we have the predictions, but we don’t know how well are the predictions, so in order to evaluate the model, we will use various evaluating models.
Evaluating the Extra Trees Classifier
We will use the confusion matrix to evaluate the performance of the extra trees classifier. A simple way to understand the confusion matrix is that every value that lies in the main diagonal shows the correct classification.
# importing seaborn import seaborn as sns # Making the Confusion Matrix from sklearn.metrics import confusion_matrix # providing actual and predicted values cm = confusion_matrix(y_test, y_pred) # If True, write the data value in each cell sns.heatmap(cm,annot=True)
As you can see, only two values have been incorrectly classified by the model while the rest have been correctly classified.
Let us also calculate the classification report of the model that contains accuracy, precision, recall, and f1-score. Learn how we can calculate these matrices from the confusion matrix.
#importing the classification report from sklearn.metrics import classification_report # printing the classification report print(classification_report(y_test, y_pred))
As you can see, we get an accuracy score of 95% which means only 5% of the testing data has been incorrectly classified while the rest has been correctly classified by the model.
Extra Trees Classifier vs Random Forest Classifier
Let us now train the random forest classifier and extra trees classifier on the same dataset with default parameter values and see which one will perform better.
First, we will initialize the random forest classifier and make predictions.
# import Random Forest classifier from sklearn.ensemble import RandomForestClassifier # instantiate the classifier random_classifier = RandomForestClassifier() # fit the model random_classifier.fit(X_train, y_train) # testing the model random_pred = random_classifier.predict(X_test)
As you can see, we have trained the random forest model and then make predictions. Let us now calculate the accuracies of both models.
# importing accuracy score from sklearn.metrics import accuracy_score #accuracy score print("Accuracy of extra trees algorithm: ", accuracy_score(y_test,y_pred)) print("Accuracy of random forest algorithm: ", accuracy_score(y_test, random_pred))
Accuracy of extra trees algorithm: 0.9473 Accuracy of random forest algorithm: 0.9736
As you can see, on the given dataset the random forest classifier has performed better than the extra trees classifier.
Extra Trees Classifier vs Decision Trees Classifier
Let us now use the decision trees classifier model and train it on the same training dataset and make predictions to compare it with the extra trees classifier.
# importing decision tree algorithm from sklearn.tree import DecisionTreeClassifier # entropy means information gain decision_classifer = DecisionTreeClassifier() # providing the training dataset decision_classifer.fit(X_train,y_train) # making predictions decision_pred = decision_classifer.predict(X_test)
Once the training is complete, let us calculate the accuracy score and compare it with the extra trees classifier.
#accuracy score print("Accuracy of extra trees algorithm: ", accuracy_score(y_test,y_pred)) print("Accuracy of random forest algorithm: ", accuracy_score(y_test, decision_pred))
Accuracy of extra trees algorithm: 0.9473 Accuracy of decision tree algorithm: 0.97
As you can see, again the decision tree classifier model was better on the given dataset than the extra trees classifier model.
Extra Trees Regressor Using Python
Now let us use the extra trees regressor on a regression dataset. This time, we will use a dataset about Bitcoin. Let us first import the dataset and print a few rows.
# importing pandas import pandas as pd # reading data file data = pd.read_csv("BTC-USD.csv") # reading the head data.head()
As you can see, there are a number of columns. We don’t need all these columns. We will use the open and closing price to predict the Volume value. So, let us remove all other columns.
# droping the column data.drop("Date", inplace=True, axis=1) data.drop("High", inplace=True, axis=1) data.drop("Low", inplace=True, axis=1) data.drop("Adj Close", inplace=True, axis=1)
Our data is ready and let us move to the training part of the extra trees regressor.
Training the Extra Trees Regressor Model
Before going to the training of the model, let us first split the dataset into testing and training parts.
Input = data.drop("Volume", axis=1) output = data['Volume'] # importing the module from sklearn.model_selection import train_test_split # splitting the dataset X_train, X_test, y_train, y_test = train_test_split(Input, output, test_size=0.25)
Once the splitting is complete, we can then initialize the extra trees regressor and train the model on the training dataset.
# importing the module from sklearn.ensemble import ExtraTreesRegressor # initializing the model regressor = ExtraTreesRegressor() # Training the model regressor.fit(X_train, y_train)
Once the training is complete, we can then use the testing dataset to make predictions using the trained model.
# Making predictions y_pred = regressor.predict(X_test)
As you can see, our model has made the predictions, but we don’t know how well the predictions are. So, let us jump into the evaluating part.
Evaluating the extra trees regressor model
Let us first visualize the actual and predict values using a line graph. Check various plots that we can plot using Various modules in Python.
# importing module import matplotlib.pyplot as plt # fitting the size of the plot plt.figure(figsize=(15, 8)) # plotting the graphs plt.plot([i for i in range(len(y_test))],y_test, color = 'green',label="actual values") plt.plot([i for i in range(len(y_test))],y_pred, color='red', label="Predicted values") # showing the plotting plt.legend() plt.show()
As you can see, the green line shows the actual values while the red plot shows the predicted value. Let us also calculate the R-square score of the model.
# Importing the required module from sklearn.metrics import r2_score # Evaluating model performance print('R-square score is :', r2_score(y_test, y_pred))
R-square score is : 0.1221059
As you can see, we get an R-square score of 0.122.
Extra Trees Regressor vs Random Forest Regressor
Let us now use the random forest regressor to train the model and will evaluate the model to compare the results with the extra trees regressor.
First, we need to initialize the random forest regressor, then train the model, and finally make predictions.
# import Random forest using python from sklearn.ensemble import RandomForestRegressor # instantiate Random forest using python regressor = RandomForestRegressor() # fit Random forest using python model regressor.fit(X_train, y_train) # making predictions for Random forest using python random_pred = regressor.predict(X_test)
Once the model has completed the predictions, we can then compare the R-square score of the random forest regressor with the extra tree regressor.
# Evaluating model performance print('R-square score of extra trees is :', r2_score(y_test, y_pred)) print('R-square score of random forest is :', r2_score(y_test, random_pred))
R-square score of extre trees is : 0.12210 R-square score of random forest is : 0.1152
As you can see, the extra tree regressor performed better than the random forest regressor on the given dataset.
Extra Trees Regressor vs Decision Trees
Now, we will compare the results of the extra trees regressor with the decision trees. Let us first initialize the decision tree regressor and then train the model to make predictions.
# importing decision tree using Python from sklearn.tree import DecisionTreeRegressor # initializing decision tree using Python model regressor = DecisionTreeRegressor() # training decision tree using Python regressor.fit(X_train,y_train) # making predictions / decision tree using Python decision_pred = regressor.predict(X_test)
Once the training and prediction is complete, we can compare the r-square scores.
# Evaluating model performance print('R-square score of extra trees is :', r2_score(y_test, y_pred)) print('R-square score of decision forest is :', r2_score(y_test, decision_pred))
R-square score of extre trees is : 0.12210 R-square score of decision forest is: -0.449
As you can see, the extra trees regressor performed better than the decision trees.
NOTE: You can access the source code and dataset from my GitHub account. Please don’t forget to follow and give me a star.
The extra trees algorithm is short for extremely randomized trees. It is similar to a random forest algorithm, but the splitting of nodes is fully randomized in extra trees. The Extra trees algorithm can be used for classification and regression problems. In this article, we discussed how we can use the extra trees algorithm for classification and regression problems. Moreover, we compared the results with random forest and decision trees algorithms