Sklearn Feature Selection With Examples

Sklearn feature selection is a process that helps to select all those features from that dataset that are most relevant to the output. Features selection is very useful as it helps to reduce the dimensionality of the data, improves the performance of the model, and makes the dataset less complicated. In this article, we will learn about sklearn feature selection with examples. We will take examples of text and use sklearn feature selection to select the most important features.

Introduction to Sklearn Feature Selection

Sklearn provides various methods for feature extraction and selection. In this section, we will go through those sklearn feature selection one by one with examples. We will learn about:

  1. Sklearn feature selector Univariate
  2. Using models as sklearn feature selection
  3. Sequential method for sklearn features selection
  4. Recursive feature elimination method for features selection
  5. Feature engineering with selection

Feature Selection Using Univariate in Sklearn

Univariate is a feature selection process that returns the most relevant features from a large set of features based on the statistical relationship between each feature. In this method, each feature from the data is evaluated separately, and then they are ranked based on the statistic tests, and only the top-ranked features are selected from the list.

Some of the common statistical tests include:

  1. F-test
  2. Chi-Squared test
  3. Mutual information

In sklearn, the SelectKBest method is used to select the k-best features using univariates statistical tests. Here is a simple syntax of how to use the SelectKBest method for feature selection:

# importing the required modules
from sklearn.feature_selection import SelectKBest, f_regression

# your dataset goes here
X = ...
y = ...

# Create the selector object - we have selected 10 features
selector = SelectKBest(f_regression, k=10)

# sklearn feature selection
X_new = selector.fit_transform(X, y)

# Get the feature names of the selected features
selected_feature_names = [X.columns[i] for i in selector.get_support(indices=True)]

Now let us create a sample dataset and then we will use the same method to extract the features.

# importing the required modules
from sklearn.datasets import make_regression, make_classification
import numpy as np

# creating a random number
np.random.seed(37)

# creating a classifitaion data
def get_classification_data():
    return make_classification(**{
        'n_samples': 2000,
        'n_features': 20,
        'n_informative': 2,
        'n_redundant': 2,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 2,
        'random_state': 37
    })

# calling classification dataset
x, y = get_classification_data()

Once we created the dataset, it is time to implement the SelectKBest method to extract features from the dataset using the Univariate method. As you can see that there is a total of 20 features in our dataset, and we will select the top 10 relevant features.

# importing the required modules
from sklearn.feature_selection import SelectKBest, f_regression

# Create the selector object - we have selected 10 features
selector = SelectKBest(f_regression, k=10)

# sklearn feature selection
X_new = selector.fit_transform(x, y)

# shape of the x_new
print("Shape of orignal data: ",x.shape)
print("Shape of extracted data: ",X_new.shape)

Output:

sklearn-feature-selector-univariate-method

As you can see that the size of the features is 10 now because we have selected only the top 10 most relevant features.

Using Different Models as Sklearn Feature Selection

In sklearn we have various models that can be used for feature selection. In this section, we will discuss the three most commonly used ones.

The first method is the wrapper method. Similar to other sklearn feature selectors, this method returns the most relevant features from the dataset. To get access to the wrapper method, we have to import the RFE from the sklearn feature selectors. The simple syntax is given below: We will be using the Random forest classifier as the training model:

#importing the modules
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Load the data
X = ...
y = ...

# Creating the model
model = RandomForestClassifier()

# Create the RFE object and fit it to the data
rfe = RFE(model, n_features_to_select=10)
X_new = rfe.fit_transform(X, y)

# Get the feature names of the selected features
selected_feature_names = [X.columns[i] for i in rfe.get_support(indices=True)]

Now let us apply the same method to our dataset and select the top 10 most relevant features.

# importing the required modules
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Create the model
model = RandomForestClassifier()

# RFE in sklearn feature selector
rfe = RFE(model, n_features_to_select=10)
X_new = rfe.fit_transform(x, y)

# shape of the x_new
print("Shape of orignal data: ",x.shape)
print("Shape of extracted data: ",X_new.shape)

Output:

sklearn-feature-selector-univariate-method

As you can see now the features of the dataset have reduced to 10.

Another method is to use the embedded method. This method involves training a machine learning model on all the features and then selecting the important features. Some models in sklearn have a parameter known as feature_importances which can be used to select the importance. For example in the syntax below, we will use the random forest with feature_importance to select the top 10 relevant features.

# importing random forest
from sklearn.ensemble import RandomForestClassifier

# Load the data
X = ...
y = ...

# Create the model and fit it to the data
model = RandomForestClassifier()
model.fit(X, y)

# Get the feature importances
importances = model.feature_importances_

# Sort the features by importance
sorted_indexes = importances.argsort()[::-1]

# Select the top 10 features
X_new = X[:, sorted_indexes[:10]]

# Get the feature names of the selected features
selected_feature_names = [X.columns[i] for i in sorted_indexes[:10]]

As you can see, first we sorted the data according to the feature importance and then selected the top 10 most relevant features. Here is the implementation:

from sklearn.ensemble import RandomForestClassifier

# Create the model and fit it to the data
model = RandomForestClassifier()
model.fit(x, y)

# Get the feature importances
importances = model.feature_importances_

# Sort the features by importance
sorted_indexes = importances.argsort()[::-1]

# Select the top 10 features
X_new = x[:, sorted_indexes[:10]]


# shape of the x_new
print("Shape of orignal data: ",x.shape)
print("Shape of extracted data: ",X_new.shape)

Output:

sklearn-feature-selector-univariate-method

As you see now there are only 10 features left.

Similarly, there are many other models available in sklearn feature selectors which can be used to extract the most relevant features from the dataset.

Sequential Method for Sklearn Features Selector

Similarly to other sklearn feature selectors, the sequential method also identifies the most relevant features. This method involves fitting a model on the given dataset and then iteratively adding and removing features based on some of the criteria which include the performance of the model.

In sklearn, the SequentialFeatureSelector is used to perform the sequential operation to find the most relevant features. Let us now implement the sequential method on our dataset to find the most relevant features.

# Importing the feature selection
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

# applying the knn_model
knn_model = KNeighborsClassifier(n_neighbors=3)

# number of feature to select
sfs = SequentialFeatureSelector(knn_model, n_features_to_select=10)

# fitting the model
sfs.fit(x, y)

# sklearn sequential feature selector
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
                  n_features_to_select=10)
sfs.get_support()
sfs.transform(x).shape
(2000, 10)

Output:

(2000, 10)

As you can see that now we have fewer features in our dataset.

Recursive Feature Elimination Method for Features Selection

Recursive Feature Elimination also known as RFE is a feature selection process that removes the least relevant features recursively. This model also uses the accuracy of the model to remove the features.

Let us take an example and apply the feature selection using RFE. We will use the SVM model as an example.

# Importing the modules
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

# Define the model
model = SVC(kernel='linear')

# Create the RFE object and fit it to the data
selector = RFE(model, n_features_to_select=10, verbose=2)

# fitting the model
X_new = selector.fit_transform(x, y)

# shape of the x_new
print("Shape of orignal data: ",x.shape)
print("Shape of extracted data: ",X_new.shape)

Output:

Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Shape of orignal data:  (2000, 20)
Shape of extracted data:  (2000, 10)

As you can see, after many iterations, the RFE algorithm was able to reduce the number of features in the dataset.

Feature Engineering With Selection in Sklearn

Feature engineering is the process of creating new features from the existing ones or transforming the existing features to get a more accurate model. This process also helps to identify the most relevant features from the dataset.

In sklearn, we use pipelines to perform feature engineering. Let us create a pipeline to get the most relevant features from the dataset.

# importing the required modules
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline


# Define the steps in the pipeline
steps = [
    ('scaler', StandardScaler()),  # Feature scaling
    ('selector', SelectKBest(f_classif, k=10)),  # Feature selection
    ('model', SVC(kernel='linear'))  # Model fitting
]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the data
pipeline.fit(x, y)

Output:

Sklearn feature selection

As you can see we have created a pipeline that is used to extract the top to most relevant features.

Summary

In this article, we learned about various sklearn feature selectors with examples. We discussed 5 different feature selection methods which are available in sklearn and we implemented all of them in Python to get the most relevant features.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top