Semi-Supervised Learning in machine learning combines labeled and unlabeled examples to expand the available data pool for model training. We know that machine learning is generally divided into three main branches which are Supervised learning, Unsupervised learning, and Reinforcement learning. But there is another branch that combines both supervised and unsupervised learning, which is called semi-supervised learning. As we said, semi-supervised learning in machine learning takes the labeled and unlabeled dataset to train the model using labeled and unlabeled dataset. In this article, we will discuss, how semi-supervised learning Python works and how a model is trained partially on labeled and partially on unlabeled data. Moreover, we will implement semi-supervised learning in Python on a sample dataset.
Full Machine learning tutorial – with more than 30 algorithms
What is Semi-supervised Learning in Machine Learning?
Labeled and unlabeled data are combined in semi-supervised learning in Python to increase the amount of data accessible for model training. As a result, by avoiding the need to manually label thousands of samples, we can enhance model performance and save a ton of time. If you are familiar with a few supervised algorithms, then semi-supervised learning is going to be super easy for you.
Semi-supervised learning in machine learning seems to be a very complex task but believe me, semi-supervised learning is super easy. Semi-supervised learning is done by self-training the model. This means the model, trains itself based on the label data available to it. It uses both vast amounts of unlabeled data and little labeled data, combining the advantages of both supervised and unsupervised learning without the difficulties associated with finding a lot of labeled data. Therefore, you don’t need to utilize as much labeled training data when training a model to label data.
A text document classifier is a typical illustration of a semi-supervised learning application. It would be nearly impossible to find a significant number of labeled text documents in this type of situation, so semi-supervised learning is the best option. Simply said, it would take too much time to have someone read through complete text documents just to categorize them. Therefore, semi-supervised learning enables the algorithm to categorize a large number of unlabeled text documents in the training set while learning from a small number of labeled text documents.
How Does Self-training Work in Semi-supervised Learning in Python?
You might think that self-training is a highly complicated process or includes some sort of magic. However, the concept of self-training is actually quite simple and can be described by the steps listed below:
- We first collect both labeled and unlabeled data, but we train our first supervised model only using labeled observations.
- After that, we apply this model to predict the category of unlabeled data.
- The third stage involves choosing observations that meet our predetermined criteria and fusing these pseudo-labels with labeled data.
- We repeat the process by using labeled and pseudo-labels to train a new supervised model. Then, we recalculate our predictions and include fresh selections of observations in the pool of pseudo-labeled data.
- We repeat these stages until we have labeled all of the data
These are the simple steps that a semi-supervised model follows to train itself.

As you can see the image above which shows the simple steps of the training of a semi-supervised model. First, the model will be trained on a small amount of labeled data and then will use that trained model to predict output for the unlabeled data ( not all unlabeled data will be used at once). In the next step, the labeled data and the predicted labeled data will be again used to train the model and it will be used to predict the outputs for the other unlabeled dataset. This process will be repeated unless all the unlabeled data is converted to predicted labels.
Label Propagation Algorithm in Semi-supervised Learning Python
Social media networks are expanding daily and have a global reach. Think of a social media network where you can target marketing efforts by knowing the interests of some users and predicting the interests of others. A graph-based semi-supervised machine learning technique can be used for this purpose known as label propagation. Using the iterative Label Propagation Algorithm (LPA), we assign labels to unlabeled locations by spreading labels throughout the dataset. In the year 2002, Xiaojin Zhu and Zoubin Ghahramani made the initial proposal for this algorithm.
To understand the working of the label propagation algorithm, let us assume we have a network of individuals with the label classes “like swimming” and “not like swimming” as shown below. Can we, therefore, predict whether the remaining individuals will be interested in swimming or not?

For LPA to work in this case, we have to make an assumption; an edge connecting two nodes carries a notion of similarity. To put it another way, if two people are related, it is very likely that they have similar interests. This assumption might be made because people tend to connect with those who share their interests.
Semi-supervised Learning in Python
Now we will apply semi-supervised learning in Python and train the model using label and unlabeled datasets. In this section, we will create a dataset for semi-supervised learning first and then will learn how we can apply it to the model. Then we will take some random datasets from Kaggle to understand semi-supervised learning fully.
Before going to the implementation part, make sure that you have installed the following Python libraries.
- sklearn
- pandas
- numpy
- matplotlib
- seaborn
How to Create a Dataset for Semi-supervised Learning in Python
Let us now create a random dataset for semi-supervised learning. We will use the make_classification()
function to create a classification dataset. We will create a binary classification dataset with two input variables and 2000 rows.
# importing the make_classification function
from sklearn.datasets import make_classification
# creading random dataset with two input variables
Input, output = make_classification(n_samples=2000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
As you can see, we have created input and output variables that contain the corresponding data. The next step is to split the dataset into testing and training parts. Here, we will assign 50% to the training part and the remaining 50% to the testing part.
# importing the train_test_split function
from sklearn.model_selection import train_test_split
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(Input, output, test_size=0.50, random_state=1, stratify=output)
As you can see, we split the dataset into training and testing parts.
In the next step, we will again split the training data into two parts. we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled. We also will set the random_State to 1.
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
Label Propagation in Sklearn Module
As our data is ready. We will implement the label propagation algorithm using sklearn module. The training process of the label propagation algorithm is pretty similar to any other machine learning algorithm. But before training the model, we have to prepare the training dataset. We have to concatenate the labeled and unlabeled training values in a single array.
# importing module
from numpy import concatenate
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
In the next step, we will put -1 to the output of the unlabeled values.
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
Now, we can concatenate the above list with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset as shown below:
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
Now our dataset is fully ready to train the label propagation model. Let us import the label propagation model from sklearn module and train the model on the given training dataset.
# importing the module
from sklearn.semi_supervised import LabelPropagation
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
Once the training is complete, we will use the model to make predictions.
# make predictions on hold out test set
y_pred = model.predict(X_test)
Now let us evaluate the performance of the model using different evaluation matrices.
Evaluating label propagation model
Now we will use the confusion matrix to evaluate the performance of the model. A confusion matrix is one of the evaluation matrices for classification models.
# importing seaborn
import seaborn as sns
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
# providing actual and predicted values
cm = confusion_matrix(y_test, y_pred)
# If True, write the data value in each cell
sns.heatmap(cm,annot=True)
Output:

The simplest way to understand the confusion matrix is that all the values that are in the main diagonal show the number of correctly classified items.
Let us also calculate the classification report of the model.
#importing the classification report
from sklearn.metrics import classification_report
# printing the classification report
print(classification_report(y_test, y_pred))
Output:

As you can see, we get an accuracy score of 94% which is a really good score.
Self-training in Python – Semi-supervised Learning
Now we will use a real dataset and self-training approach of semi-supervised learning to learn from labeled and unlabeled data. We will use a marketing campaign dataset from Kaggle.
Let us import the dataset first.
# importing pandas
import pandas as pd
# Read in data
data = pd.read_csv('marketing_campaign.csv',
encoding='utf-8', delimiter=';',
usecols=['ID', 'Year_Birth', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'MntWines', 'MntMeatProducts']
)
# Create a flag to denote whether the person has any dependants at home (either kids or teens)
data['Dependents_Flag']=data.apply(lambda x: 1 if x['Kidhome']+x['Teenhome']>0 else 0, axis=1)
# Print dataframe
data.head()
Output:

The dependents_flag is going to the output class of our model. It shows whether supermarket shoppers have any kids at home or not. So, based on their shopping, we will predict whether a person who buys items has kids in their home or not.
Before going to the training of the model, we first need to split the dataset. We will specify 75% of the data to the training part and the remaining 25% to the testing part.
# splitting dataset into testing and training parts
df_train, df_test = train_test_split(data, test_size=0.25)
# Put test data into an array
X_test=df_test[['MntMeatProducts', 'MntWines']]
y_test=df_test['Dependents_Flag'].values
Let’s now mask 95% of the training data’s labels and make a target variable that uses the value “-1” to indicate unlabeled (masked) data as we did in the above section.
# Create a flag for label masking
df_train['Random_Mask'] = True
df_train.loc[df_train.sample(frac=0.05, random_state=0).index, 'Random_Mask'] = False
# Create a new target colum with labels.
df_train['Dependents_Target']=df_train.apply(lambda x: x['Dependents_Flag'] if x['Random_Mask']==False else -1, axis=1)
# Show target value distribution
print('Target Value Distribution:')
print(df_train['Dependents_Target'].value_counts())
Output:

Self-training Classifier in Python
Now we will initialize the self-training classifier in Python. We will use SVM as a base estimator in this section.
# importing sklearn
from sklearn.svm import SVC
# Select data for modeling
X_train=df_train[['MntMeatProducts', 'MntWines']]
y_train=df_train['Dependents_Target'].values
# Specify SVC model parameters
model_svc = SVC(probability=True)
Let us now initialize the self-training classifier and fit the model.
# importing selftraining classifier
from sklearn.semi_supervised import SelfTrainingClassifier
# Specify Self-Training model parameters
self_training_model = SelfTrainingClassifier(base_estimator=model_svc)
# Fit the model
clf_ST = self_training_model.fit(X_train, y_train)
Once the training is complete, we will then move toward the evaluation part. We will use the accuracy score to evaluate the performance of the model.
# calculating the accuracy score
accuracy_score_ST = clf_ST.score(X_test, y_test)
print('Accuracy Score: ', accuracy_score_ST)
Output:

As you can see, we get an accuracy score of 80%.
Summary
semi-supervised learning (SSL) is a machine learning technique that uses a small portion of labeled data and lots of unlabeled data to train a predictive model. In this article, we discuss how semi-supervised learning works and how self-training works. Moreover, we also implemented the semi-supervised learning algorithm using Python on various datasets.
Pingback: CatBoost in Python | Hyperparameter tuning of CatBoost
Pingback: K-means clustering in Python | Visualize and implement -