One of the important things to do before the training of the model in Machine Learning is the splitting of the dataset. We usually split the data into testing and training parts. The training data is the dataset that is used in the training process. In the case of supervised learning, training data is also known as labeled data because it contains inputs and outputs. While the testing data is also known as an unlabeled dataset as it only contains the input values, not the output values. Let us dive into the details and understand what is testing data in machine learning.
Splitting Data Set in Machine Learning
Splitting the dataset is the process of dividing the data into training and testing parts before training the model. Let us assume that we have a sample dataset and we are splitting the dataset into testing and training parts. Here, for the sake of practice, we will use the Sklearn module to split the dataset.
# importing the module from sklearn.model_selection import train_test_split # splitting into testing and training parts X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size=0.25)
Here we are splitting the dataset (Input, Output) into testing and training parts. The X_train and y_train contain 75% of the dataset while the remaining 25% of the data is assigned to the testing part. You can change the portion of splitting by changing the test_size in the parameter. Usually, it is between 20% and 30%.
Another important parameter is the random state which plays a crucial role in splitting the dataset. Learn the importance and the effect of random state on the splitting of the dataset.
What is Training Data in Machine Learning?
The training data in machine learning also known as the labeled dataset is a dataset that contains both input and output values and is used to train the model. The ML model takes this data, goes through it, and tries to learn the relation between the input and the output values so that later it can use this information to make predictions on the testing or input datasets.
In the training part, we use both X_train and y_train while training the model. The X_train contains the input values and the y_train contains the corresponding output values. In Sklearn we use the fit function to train the model.
# initialzing the model model_reg = lgb.LGBMRegressor() # train the model model_reg.fit(X_train,y_train)
Here we are using the LightGBM model and training it on the training dataset. Notice that the training function (fit) takes both the input and the output values.
What is Testing Data in Machine Learning?
The training data is also known as an unlabeled dataset that contains only input values and the output values are missing. The model is expected to use the testing data in order to make predictions or give outputs. The predict() method in the Sklearn module is used to make predictions on the testing dataset.
# Making predictions reg_pred = model_reg.predict(X_test)
As you can see, the predict function only takes the input value, and the output is expected from the model.
What is the Role of y_test in Machine Learning?
Now, we know the role of X_train, y_train, and X_test. As X_train and y_train are used in the training process of the model. The X_test is used in the predictions process, then what is the role of y_test. Well, the y_test contains the output values of the X_test. In the predictions part, we used only the X_test and expect the model to give us predictions. Now, how we are going to know whether these predictions are accurate or not? In order to evaluate the predictions, we need to compare those predictions with the actual output values which are stored in y_test.
That is why the y_test and the predictions are used in evaluation metrics so that we can know the performance of our model.
Testing data is very important data in the machine learning field as it helps us to evaluate the performance of the model. Without testing the dataset, we will not be able to know whether the model is giving accurate results or not. In this short article, we went through the concept of splitting the dataset and then explained the training and testing dataset.