Sklearn one hot encoder or one hot encoding is a process of converting categorical values in the dataset to numeric values so that the Machine learning model can understand and interpret the dataset. This step is part of data preprocessing. In this article, we will learn how we can use sklearn one hot encoder to convert categorical values to numeric values by solving various examples. By the end of this article, you will learn:
- What is sklearn one hot encoder and why encoding is important in Machine Learning?
- How to use sklearn one hot encoder to encode categorical values?
- How to use sklearn one hot encoder to encode multiple columns?
What is Sklearn Module?
Sklearn, also known as Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification(KNN, SVM, Decision trees), regression(linear regression, isolation forest, random forest), clustering(k-mean clustering), and dimensionality reduction(PCA). It also supports Python numerical and scientific libraries like NumPy and SciPy.
More importantly, it has various methods for data preprocessing including random state, data splitting, data encoding, and many more. In this article, we will focus on only one encoding method which is one-hot encoding.
How Does Sklearn One Hot Encoder Work?
As we discussed earlier that sklearn on hot encoder converts the categorical values into numeric values. The One Hot Encoding technique creates a number of additional features based on the number of unique values in the categorical feature. Every unique value in the category is added as a feature. Hence the One Hot Encoding is known as the process of creating dummy variables.
For example, let us assume that we have two categorical values (Male and Female). When we apply sklearn one hot encoder, it will create two new columns, one for male and one for female and it will add value 1 if the person is male to the male column and add value 1 to the female column if the person is female.

As you can see, the sklearn one hot encoder will create new columns depending on the number of categories and fill the columns with ones and zeros. Now, these values are easy for machine learning algorithms to interpret.
Mainly there are two important reasons why we should use sklearn one hot encoder to convert categorical values to numeric values. To understand the first reason, let us import a dataset that has categorical values in the output:
Why Use Sklearn One Hot Encoder?
# importing pandas import pandas as pd # importing dataset data = pd.read_excel('Label_Encoding.xlsx') # heading of data data.head()
Output:

As you can see that the Marriage_status column has categorical values. Now, if we apply any machine learning model to this data, we will get an error because the data has categorical values. For example, let us apply xgboost algorithm to the given dataset.
# dividing the dataset X = data.drop('Marrige_Status', axis=1) y = data['Marrige_Status'] # importing the xgboost module import xgboost as xgb # Default parameters xgboost_clf = xgb.XGBClassifier() # training the model xgboost_clf.fit(X,y)
Output:

As you can see, we get an error because the model was unable to recognize the categorical values. That is why it is necessary to encode the categorical values before applying machine learning models.
The second reason for choosing sklearn one hot encoder is that it is more efficient than other encoding methods as it assigns the same value for each category so there is no weight difference between them.
Examples of Sklearn One Hot Encoder
Now we will solve various examples and learn how we can apply sklearn one hot encoder in Python to convert categorical values into numeric values. First, make sure that you have installed the sklearn module on your system. You can use the pip command to install the sklearn module on your system.
# importing sklearn module import sklearn # version checking print(sklearn.__version__)
Output:
1.1.2
In my case, I have sklearn version 1.1.2 installed on my system.
Example-1: Sklearn One Hot Encoder
Let us first import the dataset and then print the few headings to get familiar with the dataset.
# importing pandas import pandas as pd # importing dataset data = pd.read_excel('Label_Encoding.xlsx') # heading of data data.head()
Output:

As you can see, we have categorical values. Now let us import the sklearn one hot encoder and encode the categorical values into numeric ones.
# importing sklearn one hot encoding from sklearn.preprocessing import OneHotEncoder # initializing one hot encoding encoding = OneHotEncoder() # applying one hot encoding in python transformed_data = encoding.fit_transform(data[['Marrige_Status']]) # head print(transformed_data.toarray())
Output:

As you can see, the Marrige_Status column has been converted into binary values by sklearn one hot encoding method. The reason why we have only two columns is that there were only two categories in the main dataset as shown below:
# Getting one hot encoding categories print(encoding.categories_)
Output:

Now, let us add the encoded part back to our dataset and print it.
# adding the encoded values data[encoding.categories_[0]] = transformed_data.toarray() # deleting the uncoded one data.drop('Marrige_Status', axis=1, inplace=True) # data heading data.head()
Output:

As you can see, the data has been converted into numeric values.
Example-2: ColumnTransformer With OneHotEncoder
As you have seen in the first example we first converted the categories into numeric values and then added the numeric values back to the dataset. However, this process can be performed automatically by the column transformer method. Let us see how it works by solving the example.
# importing modules from sklearn.compose import make_column_transformer from seaborn import load_dataset import pandas as pd # importing dataset data = pd.read_excel('Label_Encoding.xlsx') # creading a transformer with hot encoding transformer = make_column_transformer( (OneHotEncoder(), ['Marrige_Status']), remainder='passthrough') transformed = transformer.fit_transform(data) # applying transformation on the dataset directly transformed_df = pd.DataFrame( transformed, columns=transformer.get_feature_names() ) # head transformed_df.head()
Output:

As you can see, we have successfully transformed the categories into numeric values.
Example-3: One Hot Encoding on Multiple Columns
So, far we have used the one hot encoding method to convert categorical encoding of only one column but now let us use the sklearn one hot encoder to convert multiple columns from the dataset.
We will use a built-in dataset from the Seaborn module.
# loading dataset from seaborn import load_dataset # loading dataet data = load_dataset('penguins') # heading data.head()
Output:

As you can see, we have multiple columns with different categories. Now we will aply one hot encoding method on multiple columns.
# taking only columns data = data[['island', 'sex', 'body_mass_g']] # droping any null values data = data.dropna() # encoding multiple columns transformer = make_column_transformer( (OneHotEncoder(), ['island', 'sex']), remainder='passthrough') # transforming transformed = transformer.fit_transform(data) # transformating back transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names()) # head transformed_df.head()
Output:

As you can see, we have successfully encoded the multiple columns using sklearn one hot encoder method.
Summary
One-hot encoding in machine learning is the conversion of categorical information into a format that may be fed into machine learning algorithms to improve prediction accuracy. One-hot encoding is a common method for processing categorical data in machine learning. In this short article, we learned how we can use the sklearn one hot encoder to convert the categorical values into numeric values.
Pingback: Learn How To Play The Contexto Game And Download Free - ApkGet