Do you want to know how the random state in sklearn module works and how it affects the formation of clusters? Well, here we go!
As we know, machine learning is all about data manipulation. Mostly, we divide the data into testing and training parts in order to evaluate the performance of the model. To make things fair, this splitting should be random, and in sklearn module, the random state is used. The random state is used to set the seed for the random generator so that we can ensure that the results that we get can be reproduced. Because the nature of splitting the data in train and test is randomized, we would get different data assigned to the train and test data unless you can control for the random factor. In this article, we will discuss how the random state works by taking a real dataset. Moreover, the random state in clustering also affects the formation of clusters as well and we will discuss it as well.
What is the Random State in Sklearn Module?
Scikit-learn is an open-source data analysis library and the gold standard for Machine Learning (ML) in the Python ecosystem. Key concepts and features include Algorithmic decision-making methods, including Classification: identifying and categorizing data based on patterns. As it contains a lot of Machine learning models already implemented, we can just import those models and train them on the dataset. In order to know how well the model will perform, we mostly divide the dataset into testing and training parts so that we can train the model first and then test the performance of the model using the testing data. Now, this splitting is fully randomized, which means every time we split the data, every time there will be different data in the testing part which is not what we want. The random state fixed and set specific random data to the testing and training parts.
The following example figure shows how the random state works while splitting the dataset.

As you can see, there are three different rows each having data randomly. However, the last two columns have random data with a fixed random state. As you can see, in the first column, each time we run, we get different data because data is stored randomly. While in the second column, as the random_state is fixed to1, we will get random data but it will not change if we run it again. This is proved in the third column, where we have random_state 2 and we get different data from column 2 but get the same data when we run it again.
In conclusion, any positive integer can be assigned to random_state but each time we assign a different number, the data will randomly be selected.
How Does the Random State Work in Sklearn Module?
As of now, you have a basic concept of what the random state is and how it works. In this section, we will go further into depth and will try to understand the random state in sklearn module by implementing the random state in Python. For simplicity, we will create a random dataset and then will split the data into testing and training parts using the random state in sklearn.
import pandas as pd
# assign data of lists.
data = {'Name': ['A', 'B', 'C', 'D','E','F','G'], 'Age': [20, 21, 19, 18, 2, 23, 56]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
Let us first create a dataset.
Output:

Now, let us split the dataset into inputs and outputs. Then we will split the dataset into testing and training parts and we will fix the random state to 1.
# splitting data into input and output
X = df['Name']
y = df['Age']
# importing the module
from sklearn.model_selection import train_test_split
# random state in sklearn to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1
# printing the testind and training parts
print("Trainig part is :\n", X_train)
print("Testing part is :\n", X_test)
Output:

As you can see, the data has been randomly assigned to testing and training parts, but if you run the same code again, the values will not change. Because the random state is fixed which means each time we run the code, each time we will get the same randomly distributed data. But if we change the value of the random state, then we will get different values as shown below:
# random state is fixed to 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
# printing the testind and training parts
print("Trainig part is :\n", X_train)
print("Testing part is :\n", X_test)
Output:

As you can see, this time we get different values for the testing and training parts because we changed the value of the random state.
Random State in k-means Clustering
One of the important things to note is that random state has a huge effect on the formation fo clusters in clustering algorithms. In this section, we will take the K-means clustering algorithm and will see how the formation of clusters is affected by changing the random state in sklearn.
Let us create a random dataset and visualize using scatter plot.
import numpy as np
# get a random matrix of size (3, 3) in the range [0, 100]
matrix = np.random.random((1000, 2)) * 100
# importing the module
import matplotlib.pyplot as plt
# image size
plt.figure(figsize=(10,5))
# ploting scatered graph
plt.scatter(matrix[:,0], y=matrix[:,1], c='m')
Output:

As you can see, we have a fully randomly created dataset. Now, we will apply the k-means clustering with different random states to see how the formation of clusters changes.
# importing the k-means
from sklearn.cluster import KMeans
# ploting in line plots
fig, ax = plt.subplots(1, 3, gridspec_kw={'wspace': 0.3}, figsize=(15,5))
# k-means clustering in Python
for i in range(3):
km = KMeans(n_clusters = 3, init='random', n_init=1, random_state=i)
km.fit(matrix)
ax[i].scatter(x= matrix[:,0], y=matrix[:,1], c= km.labels_);
Output:

As you can see, each time we change the random state, we get different clusters. So, it is important to know what value of random state produces optimum clusters.
Summary
The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again. The random state in sklearn has a great impact on the performance of the model as it specifies the randomness of the data. Also, in clustering models, the random state helps to create clusters as well. In this article, we discuss how the random state in sklearn module works. Furthermore, we learned how the random state in sklearn module affects the formation of clusters in clustering models.
Pingback: Semi-supervised learning in machine learning using Python - TechFor-Today
Pingback: Neural Networks for Classification using TensorFlow
Pingback: CatBoost in Python | Hyperparameter tuning of CatBoost
Pingback: XGBoost Algorithm | Hyperparameter tuning of XGBoost algorithm
Pingback: How to do Hyperparameter tuning of Gradient boosting algorithm using Python? - TechFor-Today
Pingback: Adaboost and hyperparameter tuning of AdaBoost using Python
Pingback: How to apply Machine Learning SVM using Python on classification
Pingback: How to apply Random forest using python for classification and regression problems - TechFor-Today
Pingback: KNN algorithm for classification using Python
Pingback: Decision tree using python in simple for classification and regression
Pingback: Hyperparameter tuning of Linear regression algorithm in machine learning - TechFor-Today
Pingback: What is lightGBM and how to do hyperparameter tuning of LightGBM - TechFor-Today
Pingback: Extra trees classifier and regressor using Python - TechFor-Today