The 2022 FIFA World Cup is an international association football tournament contested by the men’s national teams of FIFA’s member associations and the 22nd FIFA World Cup. People have a craze about football and they start predicting the winner of the FIFA world cup 2022. However, these predictions are not accurate but there is more probability that these predictions will be accurate. Although the winning of a team is based on their performance in those 90 minutes. In this short article, we have found you the winner of the FIFA world cup 2022 Qatar based on a trained Machine learning model. Moreover, this model also helps us to identify the teams that will qualify for the semi-final and final. Even, if you are someone who doesn’t know Machine learning, we will try to explain each and every part of a coding step by step.
Predicting the winner of FIFA world cup 2022 using machine learning
Machine Learning is a rapidly evolving technology that automatically allows computers to learn from previous data. Machine learning employs a variety of algorithms to create mathematical models to make predictions based on past data. The most common use cases for Machine learning are image and video analysis, speech recognition, email filtering, recommending and forecasting systems, and many more.
Machine learning models need preprocessing datasets to find hidden patterns and come up with predictions. Here we will see the winner of the FIFA world cup 2022 using machine learning by SERGIO PESSOA who has predicted the FIFA world cup 2022 on Kaggle.
Exploring dataset for Fifa 2022 winner
Two different types of data has been used to train the model in this project; International football results from 1872 to 2022 and FIFA World Ranking 1992-2022.
The idea here is to simulate the FIFA 2022 World Cup games with machine learning, in order to predict the competition’s winner. The project uses two datasets: International football results from 1872 to 2022 and FIFA World Ranking 1992-2022. This project has used binary classification to predict the winner of FIFA world cup 2022 as it is easy to analyze the model’s results, so the model predicts between a win from the home team and a draw/win from the away team. Then, to remove the advantage of the away team, the project predicted the results of changing teams from away and home (because there is no home advantage in the World Cup) and used as probabilities the mean of the two predictions.
The very first step is to load the dataset and analyze it and make it suitable to train the model.
In the next step, the project checks for null values and drops null values if there are any and sorts the data based on date.
Now, it is time to import the other dataset as well.
In the next step, the project is adjusting the name of some teams as they have different names in the FIFA world cup 2022.
Now the data needs to be merged to get a new dataset for FIFA world cup 2022 which is done in the following code.
Now the data is ready and need some feature engineering process to get more insights.
Feature engineering of data for FIFA world cup 2022
Feature engineering refers to manipulation — addition, deletion, combination, mutation — of your data set to improve machine learning model training, leading to better performance and greater accuracy. Effective feature engineering is based on sound knowledge of the business problem and the available data sources.
The possible features that impact the game can be the following in our case:
- Past game points made
- Past goals scored and suffered
- The importance of the game (friendly or not)
- The rank of the teams
- Rank increasement of the teams
- Goals made and suffered by ranking faced
Now the project will create a feature that says which team won and how many points they made at the game.
The four different values in the above code ( 0, 1, 2 3) represents lose (0), win(3), and draw(1). Also, it’s supposed that FIFA Rank points and the FIFA Ranking of the same team are negatively correlated, and we should use only one of them to create new features.
The confusion matrix is a matrix that is used to evaluate the performance of the classification model. A classification model is a model that is used to predict discrete output values. Sklearn confusion matrix contains actual values and predicted values and which helps us to understand how good the model is at predicting.
In the next step, the project creates columns that will help in the creation of the features: ranking difference, points won at the game vs. team faced rank, and goals difference in the game. All features that are not different should be created for the two teams (away and home).
In the next step, the project separates home and away-from-home teams.
The predictive features that this project will consider are as follows:
The last step of feature engineering is to remove those values from the dataset whose means were not calculated.
Data analysis of FIFA world cup 2022
This section is all about analyzing data and checking if the given features are strong enough to predict the FIFA world cup 2022 using a machine learning model.
As the first part of data analysis, the project used violin plots to check the distribution of datasets. A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. It is used to visualize the distribution of numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.
This project finds out that the rank difference is the only good separator of the data. But, we can create features that get the differences between the home and away teams and analyze if they are good at separating the data. You can use your own analytical skills and come up with a different approach.
Based on the violin plots, the project has chosen the following features.
- rank_dif
- goals_dif
- goals_dif_l5
- goals_suf_dif
- goals_suf_dif_l5
Here are other various plots that this project has done to analyze the data for the FIFA world cup 2022.
Based on the above features, the project has finalized the following features.
- rank_dif
- goals_dif
- goals_dif_l5
- goals_suf_dif
- goals_suf_dif_l5
- dif_rank_agst
- dif_rank_agst_l5
- goals_per_ranking_dif
- dif_points_rank
- dif_points_rank_l5
- is_friendly
Building a Machine Learning model for FIFA world cup 2022
Now it is time to train the model based on the data. But the very first step is to split the dataset into inputs and outputs.
Now the splitting of a dataset into testing and training parts.
This project has used a Random forest classifier and gradient boosting algorithm to make predictions. You can use any other machine learning algorithms as well ( most probably catboot and lightgbm will give the best result)
First, the gradient boosting algorithm is being trained.
Now it is time to train a random forest classifier.
Evaluating the models to predict world cup 2022
Once the training is complete, it is time to evaluate the models.
This shows that the random forest classifier has performed better than gradient boosting.
Predicting the winner of the world cup 2022 using ML
Now it is time to create the groups of teams.
The aim is now to create a couple of functions that will help to simulate.
Now the predictions in Group Stage:
Now predictions for the playoff.
This shows that the final will be between:
Portugal and Brazil
And the winner will be Brazil with a 52% winning probability.
One of the best things that MR.SERGIO PESSOA has done is visualized the games as well.
Congratulations to brazil. The Machine learning techniques predict it for being a 6th-time champion.

Summary
Specially thanks to Mr.SERGIO PESSOA to work so hard to come up with the predictive models. You can access to original code from the link.