XGBoost for World Cup Winner prediction¶

This notebook demonstrates how to use XGBoost to predict the winner of the 2018 FIFA World Cup. The data used for this notebook is from this Kaggle competition.

XGBoost¶

XGBoost (for eXtreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. The primary principle behind boosting is to focus on the hard-to-classify instances and get them right.

General idea of XGBoost¶

Objective Function¶

The objective function in XGBoost is given by : $$Obj(\theta)=L(\theta)+ \omega(\theta)$$

Where:

$L(\theta)$ is the loss function.
$\omega(\theta)$ is the regularization term.

Important Features of XGBoost¶

Handling Missing Data¶

XGBoost has an in-built routine to handle missing values. During training, the model learns the best direction to take for missing values.

Parallel Processing¶

XGBoost is designed to be distributed and can harness the power of multi-core computers, which makes it faster than other implementations of gradient boosting.

Tree Pruning¶

Unlike other boosting algorithms that build trees depth-wise, XGBoost constructs the tree depth-first and prunes it using the max_depth parameter.

Built-in Cross-Validation¶

XGBoost allows the user to run a cross-validation at each iteration of the boosting process, which helps in selecting the best number of boosting rounds.

Regularization¶

XGBoost includes L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents overfitting.

Hyperparameters¶

Some of the important hyperparameters in XGBoost include:

learning_rate or eta: Step size shrinkage used to prevent overfitting.
max_depth: Maximum depth of the tree.
subsample: Fraction of samples used per tree.
colsample_bytree: Fraction of features used per tree.
n_estimators: Number of boosting rounds.
objective: Specifies the learning task and the corresponding objective function.

Advantages and Disadvantages¶

Advantages¶

Computationally efficient with parallel processing.
Regularization helps avoid overfitting.
Handles missing values.
Can be used for classification, regression, ranking, and user-defined prediction tasks.

Disadvantages¶

Might require careful tuning of parameters.
Can be memory-intensive.
Less interpretable than simpler models.

XGBoost in Python¶

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

Copied!





matches = pd.read_csv("./Datasets/results.csv")
rankings = pd.read_csv("./Datasets/fifa_ranking.csv")
world_cup_matches = pd.read_csv("./Datasets/World Cup 2018 Dataset.csv")
players = pd.read_csv("./Datasets/FullData.csv")
all_time_stats = pd.read_csv("./Datasets/all_time_fifa_statistics.csv")
matches = pd.read_csv("./Datasets/results.csv")
rankings = pd.read_csv("./Datasets/fifa_ranking.csv")
world_cup_matches = pd.read_csv("./Datasets/World Cup 2018 Dataset.csv")
players = pd.read_csv("./Datasets/FullData.csv")
all_time_stats = pd.read_csv("./Datasets/all_time_fifa_statistics.csv")

We don't need all the data in every file. Some country names differ depending on the year (Germany counted as two countries before the fall of the Berlin Wall in 1989). So we are going to start a first phase of cleaning

In [3]:

Copied!





rankings = rankings.loc[:,['rank',
                           'country_full',
                           'country_abrv',
                           'cur_year_avg_weighted',
                           'rank_date',
                           'two_year_ago_weighted',
                           'three_year_ago_weighted']]
rankings = rankings.replace({"IR Iran": "Iran"})
rankings['weighted_points'] =  rankings['cur_year_avg_weighted'] + rankings['two_year_ago_weighted'] + rankings['three_year_ago_weighted']
rankings["rank_date"] = pd.to_datetime(rankings["rank_date"])
rankings = rankings.loc[:,['rank',
                           'country_full',
                           'country_abrv',
                           'cur_year_avg_weighted',
                           'rank_date',
                           'two_year_ago_weighted',
                           'three_year_ago_weighted']]
rankings = rankings.replace({"IR Iran": "Iran"})
rankings['weighted_points'] =  rankings['cur_year_avg_weighted'] + rankings['two_year_ago_weighted'] + rankings['three_year_ago_weighted']
rankings["rank_date"] = pd.to_datetime(rankings["rank_date"])

In [11]:

Copied!

rankings.describe()
rankings.describe()

Out[11]:

	rank	cur_year_avg_weighted	two_year_ago_weighted	three_year_ago_weighted	weighted_points
count	57793.000000	57793.000000	57793.000000	57793.000000	57793.000000
mean	101.628086	61.798602	17.933277	11.834811	91.566691
std	58.618424	138.014883	40.888849	27.106675	197.891852
min	1.000000	0.000000	0.000000	0.000000	0.000000
25%	51.000000	0.000000	0.000000	0.000000	0.000000
50%	101.000000	0.000000	0.000000	0.000000	0.000000
75%	152.000000	32.250000	6.450000	4.250000	64.810000
max	209.000000	1158.660000	347.910000	240.150000	1511.500000

In [12]:

Copied!





matches = matches.replace({"Germany DR": "Germany", "China": "China PR"})
matches["date"] = pd.to_datetime(matches["date"])

world_cup_matches = world_cup_matches.loc[:, ['Team',
                                              'Group',
                                              'First match \nagainst',
                                              'Second match\n against',
                                              'Third match\n against']]
world_cup_matches = world_cup_matches.dropna(how='all')
world_cup_matches = world_cup_matches.replace({"IRAN": "Iran",
                               "Costarica": "Costa Rica",
                               "Porugal": "Portugal",
                               "Columbia": "Colombia",
                               "Korea" : "Korea Republic"})
world_cup_matches = world_cup_matches.set_index('Team')
world_cup_matches.head()
matches = matches.replace({"Germany DR": "Germany", "China": "China PR"})
matches["date"] = pd.to_datetime(matches["date"])

world_cup_matches = world_cup_matches.loc[:, ['Team',
                                              'Group',
                                              'First match \nagainst',
                                              'Second match\n against',
                                              'Third match\n against']]
world_cup_matches = world_cup_matches.dropna(how='all')
world_cup_matches = world_cup_matches.replace({"IRAN": "Iran",
                               "Costarica": "Costa Rica",
                               "Porugal": "Portugal",
                               "Columbia": "Colombia",
                               "Korea" : "Korea Republic"})
world_cup_matches = world_cup_matches.set_index('Team')
world_cup_matches.head()

Out[12]:

	Group	First match against	Second match against	Third match against
Team
Russia	A	Saudi Arabia	Egypt	Uruguay
Saudi Arabia	A	Russia	Uruguay	Egypt
Egypt	A	Uruguay	Russia	Saudi Arabia
Uruguay	A	Egypt	Saudi Arabia	Russia
Portugal	B	Spain	Morocco	Iran

In [13]:

Copied!

matches.head()
matches.head()

Out[13]:

	date	home_team	away_team	home_score	away_score	tournament	city	country	neutral
0	1872-11-30	Scotland	England	0	0	Friendly	Glasgow	Scotland	False
1	1873-03-08	England	Scotland	4	2	Friendly	London	England	False
2	1874-03-07	Scotland	England	2	1	Friendly	Glasgow	Scotland	False
3	1875-03-06	England	Scotland	2	2	Friendly	London	England	False
4	1876-03-04	Scotland	England	3	0	Friendly	Glasgow	Scotland	False

Given the amount of data we have and the little missing data, we decide to simply erase the lines where there is missing data. Let's finish importing player stats.

In [15]:

Copied!





players = players.loc[:, ["Nationality",
                            "Rating",
                            "Age",
                            "Weak_foot",
                            "Skill_Moves",
                            "Ball_Control",
                            "Dribbling",
                            "Marking",
                            "Sliding_Tackle",
                            "Standing_Tackle",
                            "Aggression",
                            "Reactions",
                            "Attacking_Position",
                            "Interceptions",
                            "Vision",
                            "Composure",
                            "Crossing",
                             "Short_Pass",
                             "Long_Pass",
                             "Acceleration",
                             "Speed",
                             "Stamina",
                             "Strength",
                             "Balance",
                             "Agility",
                             "Jumping",
                             "Heading",
                             "Shot_Power",
                             "Finishing",
                             "Long_Shots",
                             "Curve",
                             "Freekick_Accuracy",
                             "Penalties",
                             "Volleys"]]
players.describe()
players = players.loc[:, ["Nationality",
                            "Rating",
                            "Age",
                            "Weak_foot",
                            "Skill_Moves",
                            "Ball_Control",
                            "Dribbling",
                            "Marking",
                            "Sliding_Tackle",
                            "Standing_Tackle",
                            "Aggression",
                            "Reactions",
                            "Attacking_Position",
                            "Interceptions",
                            "Vision",
                            "Composure",
                            "Crossing",
                             "Short_Pass",
                             "Long_Pass",
                             "Acceleration",
                             "Speed",
                             "Stamina",
                             "Strength",
                             "Balance",
                             "Agility",
                             "Jumping",
                             "Heading",
                             "Shot_Power",
                             "Finishing",
                             "Long_Shots",
                             "Curve",
                             "Freekick_Accuracy",
                             "Penalties",
                             "Volleys"]]
players.describe()

Out[15]:

	Rating	Age	Weak_foot	Skill_Moves	Ball_Control	Dribbling	Marking	Sliding_Tackle	Standing_Tackle	Aggression	...	Agility	Jumping	Heading	Shot_Power	Finishing	Long_Shots	Curve	Freekick_Accuracy	Penalties	Volleys
count	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	...	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000	17588.000000
mean	66.166193	25.460314	2.934103	2.303161	57.972766	54.802877	44.230327	45.565499	47.441096	55.920173	...	63.206732	64.918524	52.393109	55.581192	45.157607	47.403173	47.181146	43.383443	49.165738	43.275586
std	7.083012	4.680217	0.655927	0.746156	16.834779	18.913857	21.561703	21.515179	21.827815	17.445464	...	14.618163	11.430807	17.473703	17.600155	19.374428	19.211887	18.464396	17.701903	15.871735	17.710839
min	45.000000	17.000000	1.000000	1.000000	5.000000	4.000000	3.000000	5.000000	3.000000	2.000000	...	11.000000	15.000000	4.000000	3.000000	2.000000	4.000000	6.000000	4.000000	7.000000	3.000000
25%	62.000000	22.000000	3.000000	2.000000	53.000000	47.000000	22.000000	23.000000	26.000000	44.000000	...	55.000000	58.000000	45.000000	45.000000	29.000000	32.000000	34.000000	31.000000	39.000000	30.000000
50%	66.000000	25.000000	3.000000	2.000000	63.000000	60.000000	48.000000	51.000000	54.000000	59.000000	...	65.000000	65.000000	56.000000	59.000000	48.000000	52.000000	48.000000	42.000000	50.000000	44.000000
75%	71.000000	29.000000	3.000000	3.000000	69.000000	68.000000	64.000000	64.000000	66.000000	70.000000	...	74.000000	73.000000	65.000000	69.000000	61.000000	63.000000	62.000000	57.000000	61.000000	57.000000
max	94.000000	47.000000	5.000000	5.000000	95.000000	97.000000	92.000000	95.000000	92.000000	96.000000	...	96.000000	95.000000	94.000000	93.000000	95.000000	91.000000	92.000000	93.000000	96.000000	93.000000

8 rows × 33 columns

In [16]:

Copied!

players = players.dropna(how="all")
grouped = players.groupby(["Nationality"], as_index = False)
players = grouped.aggregate(np.mean)
players = players.dropna(how="all")
grouped = players.groupby(["Nationality"], as_index = False)
players = grouped.aggregate(np.mean)

The end of the part of the code is used to calculate the average of the statistics of the players in each team so that we can then integrate them into the comparison between each country.

Merge data¶

Our data is now imported but we will need to merge it so that our algorithm can learn from the different statistics. It will have to be done in several steps.

First, the ranks and the dates of the matches do not correspond exactly. Indeed, we have the ranks month-to-month while we have a day-to-day date for the matches. It will therefore be necessary to create a day-to-day classification so that we can merge our columns.

Once this is done, we do a first merge.

In [20]:

Copied!

rankings.head()
rankings.head()

Out[20]:

	rank	country_full	country_abrv	rank_date
0	1	Germany	GER	1993-08-08
1	2	Italy	ITA	1993-08-08
2	3	Switzerland	SUI	1993-08-08
3	4	Sweden	SWE	1993-08-08
4	5	Argentina	ARG	1993-08-08

In [21]:

Copied!





rankings = rankings.set_index(['rank_date'])\
            .groupby(['country_full'], group_keys=False)\
            .resample('D').first()\
            .fillna(method='ffill')\
            .reset_index()


rankings.head()
rankings = rankings.set_index(['rank_date'])\
            .groupby(['country_full'], group_keys=False)\
            .resample('D').first()\
            .fillna(method='ffill')\
            .reset_index()


rankings.head()

Out[21]:

	rank_date	rank	country_full	country_abrv
0	2003-01-15	204.0	Afghanistan	AFG
1	2003-01-16	204.0	Afghanistan	AFG
2	2003-01-17	204.0	Afghanistan	AFG
3	2003-01-18	204.0	Afghanistan	AFG
4	2003-01-19	204.0	Afghanistan	AFG

In [35]:

Copied!

rankings.info()
rankings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1830658 entries, 0 to 1830657
Data columns (total 8 columns):
rank_date                  datetime64[ns]
rank                       float64
country_full               object
country_abrv               object
cur_year_avg_weighted      float64
two_year_ago_weighted      float64
three_year_ago_weighted    float64
weighted_points            float64
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 111.7+ MB

In [36]:

Copied!





matches = matches.merge(rankings,
                        left_on=['date', 'home_team'],
                        right_on=['rank_date', 'country_full'])
matches.head()
matches = matches.merge(rankings,
                        left_on=['date', 'away_team'],
                        right_on=['rank_date', 'country_full'],
                        suffixes=('_home', '_away'))
matches = matches.merge(rankings,
                        left_on=['date', 'home_team'],
                        right_on=['rank_date', 'country_full'])
matches.head()
matches = matches.merge(rankings,
                        left_on=['date', 'away_team'],
                        right_on=['rank_date', 'country_full'],
                        suffixes=('_home', '_away'))

In [39]:

Copied!





matches = matches.merge(players,
                       left_on =["home_team"],
                       right_on = ["Nationality"])

matches = matches.merge(players,
                        left_on = ['away_team'],
                        right_on = ["Nationality"],
                        suffixes = ('_home', "_away"))

matches = matches.merge(all_time_stats,
                       left_on = ["home_team"],
                       right_on = ["Country"])

matches = matches.merge(all_time_stats,
                       left_on = ["away_team"],
                        right_on = ["Country"],
                       suffixes = ("_home", "_away"))
matches = matches.merge(players,
                       left_on =["home_team"],
                       right_on = ["Nationality"])

matches = matches.merge(players,
                        left_on = ['away_team'],
                        right_on = ["Nationality"],
                        suffixes = ('_home', "_away"))

matches = matches.merge(all_time_stats,
                       left_on = ["home_team"],
                       right_on = ["Country"])

matches = matches.merge(all_time_stats,
                       left_on = ["away_team"],
                        right_on = ["Country"],
                       suffixes = ("_home", "_away"))

In [40]:

Copied!

matches.info()
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6057 entries, 0 to 6056
Columns: 119 entries, date to Best_finish_away
dtypes: bool(1), datetime64[ns](3), float64(78), int64(22), object(15)
memory usage: 5.5+ MB

How will we evaluate the different teams that compete? A simple way is to take the difference of each stats between the teams. For example, we will take the difference in position in the FIFA rankings, the age difference between the players etc. This process is a bit tedious because you will have to do everything by hand, but here it is:

In [41]:

Copied!





matches['rank_difference'] = matches['rank_home'] - matches['rank_away']
matches['average_rank'] = (matches['rank_home'] + matches['rank_away'])/2
matches['score_difference'] = matches['home_score'] - matches['away_score']
matches["point_difference"] = matches['weighted_points_home'] - matches['weighted_points_away']
matches["rating_difference"] = matches["Rating_home"] - matches["Rating_away"]
matches["Age_difference"] = matches["Age_home"] - matches["Age_away"]
matches["Weak_foot_difference"] = matches["Weak_foot_home"] - matches["Weak_foot_away"]
matches["Skill_Moves_difference"] = matches["Skill_Moves_home"] - matches["Skill_Moves_away"]
matches["Ball_Control_difference"] = matches["Ball_Control_home"] - matches["Ball_Control_away"]
matches["Dribbling_difference"] = matches["Dribbling_home"] - matches["Dribbling_away"]
matches["Marking_difference"] = matches["Marking_home"] - matches["Marking_away"]
matches["Sliding_Tackle_difference"] = matches["Sliding_Tackle_home"] - matches["Sliding_Tackle_away"]
matches["Standing_Tackle_difference"] = matches["Standing_Tackle_home"] - matches["Standing_Tackle_away"]
matches["Aggression_difference"] = matches["Aggression_home"] - matches["Aggression_away"]
matches["Reactions_difference"] = matches["Reactions_home"] - matches["Reactions_away"]
matches["Attacking_Position_difference"] = matches["Attacking_Position_home"] - matches["Attacking_Position_away"]
matches["Interceptions_difference"] = matches["Interceptions_home"] - matches["Interceptions_away"]
matches["Vision_difference"] = matches["Vision_home"] - matches["Vision_away"]
matches["Composure_difference"] = matches["Composure_home"] - matches["Composure_away"]
matches["Crossing_difference"] = matches["Crossing_home"] - matches["Crossing_away"]
matches["Short_Pass_difference"] = matches["Short_Pass_home"] - matches["Short_Pass_away"]
matches["Long_Pass_difference"] = matches["Long_Pass_home"] - matches["Long_Pass_away"]
matches["Stamina_difference"] = matches["Stamina_home"] - matches["Stamina_away"]
matches["Penalties_difference"] = matches["Penalties_home"] - matches["Penalties_away"]
matches["Acceleration_difference"] = matches["Acceleration_home"] - matches["Acceleration_away"]
matches["Speed_difference"] = matches["Speed_home"] - matches["Speed_away"]
matches["Strength_difference"] = matches["Strength_home"] - matches["Strength_away"]
matches["Balance_difference"] = matches["Balance_home"] - matches["Balance_away"]
matches["Agility_difference"] = matches["Agility_home"] - matches["Agility_away"]
matches["Jumping_difference"] = matches["Jumping_home"] - matches["Jumping_away"]
matches["Heading_difference"] = matches["Heading_home"] - matches["Heading_away"]
matches["Shot_Power_difference"] = matches["Shot_Power_home"] - matches["Shot_Power_away"]
matches["Finishing_difference"] = matches["Finishing_home"] - matches["Finishing_away"]
matches["Long_Shots_difference"] = matches["Long_Shots_home"] - matches["Long_Shots_away"]
matches["Curve_difference"] = matches["Curve_home"] - matches["Curve_away"]
matches["Freekick_Accuracy_difference"] = matches["Freekick_Accuracy_home"] - matches["Freekick_Accuracy_away"]
matches["Volleys_difference"] = matches["Volleys_home"] - matches["Volleys_away"]
matches["Part's_difference"] = matches["Part's_home"] - matches["Part's_away"]
matches["Played_difference"] = matches["Played_home"] - matches["Played_away"]
matches["Won_difference"] = matches["Won_home"] - matches["Won_away"]
matches["Drawn_difference"] = matches["Drawn_home"] - matches["Drawn_away"]
matches["Lost_difference"] = matches["Lost_home"] - matches["Lost_away"]
matches["Goal_Difference_difference"] = matches["Goal Difference_home"] - matches["Goal Difference_away"]
matches["Points_difference"] = matches["Points_home"] - matches["Points_away"]
matches["Average_points_difference"] = matches["Average_points_home"] - matches["Average_points_away"]
matches['is_won'] = matches['score_difference'] > 0 # take draw as lost
matches['is_stake'] = matches['tournament'] != 'Friendly'
matches['rank_difference'] = matches['rank_home'] - matches['rank_away']
matches['average_rank'] = (matches['rank_home'] + matches['rank_away'])/2
matches['score_difference'] = matches['home_score'] - matches['away_score']
matches["point_difference"] = matches['weighted_points_home'] - matches['weighted_points_away']
matches["rating_difference"] = matches["Rating_home"] - matches["Rating_away"]
matches["Age_difference"] = matches["Age_home"] - matches["Age_away"]
matches["Weak_foot_difference"] = matches["Weak_foot_home"] - matches["Weak_foot_away"]
matches["Skill_Moves_difference"] = matches["Skill_Moves_home"] - matches["Skill_Moves_away"]
matches["Ball_Control_difference"] = matches["Ball_Control_home"] - matches["Ball_Control_away"]
matches["Dribbling_difference"] = matches["Dribbling_home"] - matches["Dribbling_away"]
matches["Marking_difference"] = matches["Marking_home"] - matches["Marking_away"]
matches["Sliding_Tackle_difference"] = matches["Sliding_Tackle_home"] - matches["Sliding_Tackle_away"]
matches["Standing_Tackle_difference"] = matches["Standing_Tackle_home"] - matches["Standing_Tackle_away"]
matches["Aggression_difference"] = matches["Aggression_home"] - matches["Aggression_away"]
matches["Reactions_difference"] = matches["Reactions_home"] - matches["Reactions_away"]
matches["Attacking_Position_difference"] = matches["Attacking_Position_home"] - matches["Attacking_Position_away"]
matches["Interceptions_difference"] = matches["Interceptions_home"] - matches["Interceptions_away"]
matches["Vision_difference"] = matches["Vision_home"] - matches["Vision_away"]
matches["Composure_difference"] = matches["Composure_home"] - matches["Composure_away"]
matches["Crossing_difference"] = matches["Crossing_home"] - matches["Crossing_away"]
matches["Short_Pass_difference"] = matches["Short_Pass_home"] - matches["Short_Pass_away"]
matches["Long_Pass_difference"] = matches["Long_Pass_home"] - matches["Long_Pass_away"]
matches["Stamina_difference"] = matches["Stamina_home"] - matches["Stamina_away"]
matches["Penalties_difference"] = matches["Penalties_home"] - matches["Penalties_away"]
matches["Acceleration_difference"] = matches["Acceleration_home"] - matches["Acceleration_away"]
matches["Speed_difference"] = matches["Speed_home"] - matches["Speed_away"]
matches["Strength_difference"] = matches["Strength_home"] - matches["Strength_away"]
matches["Balance_difference"] = matches["Balance_home"] - matches["Balance_away"]
matches["Agility_difference"] = matches["Agility_home"] - matches["Agility_away"]
matches["Jumping_difference"] = matches["Jumping_home"] - matches["Jumping_away"]
matches["Heading_difference"] = matches["Heading_home"] - matches["Heading_away"]
matches["Shot_Power_difference"] = matches["Shot_Power_home"] - matches["Shot_Power_away"]
matches["Finishing_difference"] = matches["Finishing_home"] - matches["Finishing_away"]
matches["Long_Shots_difference"] = matches["Long_Shots_home"] - matches["Long_Shots_away"]
matches["Curve_difference"] = matches["Curve_home"] - matches["Curve_away"]
matches["Freekick_Accuracy_difference"] = matches["Freekick_Accuracy_home"] - matches["Freekick_Accuracy_away"]
matches["Volleys_difference"] = matches["Volleys_home"] - matches["Volleys_away"]
matches["Part's_difference"] = matches["Part's_home"] - matches["Part's_away"]
matches["Played_difference"] = matches["Played_home"] - matches["Played_away"]
matches["Won_difference"] = matches["Won_home"] - matches["Won_away"]
matches["Drawn_difference"] = matches["Drawn_home"] - matches["Drawn_away"]
matches["Lost_difference"] = matches["Lost_home"] - matches["Lost_away"]
matches["Goal_Difference_difference"] = matches["Goal Difference_home"] - matches["Goal Difference_away"]
matches["Points_difference"] = matches["Points_home"] - matches["Points_away"]
matches["Average_points_difference"] = matches["Average_points_home"] - matches["Average_points_away"]
matches['is_won'] = matches['score_difference'] > 0 # take draw as lost
matches['is_stake'] = matches['tournament'] != 'Friendly'

The management of each of our variables that will follow will also be somewhat long and there are certainly ways to manage this in a better way, but, due to time constraints, we preferred to proceed this way.

Building the model¶

In [42]:

Copied!





from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

X = matches.loc[:,['average_rank',
                    'rank_difference',
                    "point_difference",
                    'is_stake',
                    "rating_difference",
                     "Age_difference",
                    "Weak_foot_difference",
                     "Skill_Moves_difference",
                    "Ball_Control_difference",
                     "Dribbling_difference",
                     "Marking_difference",
                     "Sliding_Tackle_difference",
                     "Standing_Tackle_difference",
                     "Aggression_difference",
                     "Reactions_difference",
                     "Interceptions_difference",
                     "Vision_difference",
                   "Crossing_difference",
                     "Short_Pass_difference",
                     "Long_Pass_difference",
                    "Stamina_difference",
                     "Penalties_difference",
                     "Acceleration_difference",                   
                     "Speed_difference",
                    "Strength_difference",
                    "Balance_difference",
                     "Agility_difference",
                     "Jumping_difference",
                    "Heading_difference",
                     "Shot_Power_difference",
                    "Finishing_difference",
                   "Long_Shots_difference",
                     "Curve_difference",
                    "Freekick_Accuracy_difference",
                     "Volleys_difference",
                     "Won_difference",
                     "Drawn_difference",
                     "Lost_difference",
                     "Average_points_difference",
                  ]]
y = matches['is_won']
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

X = matches.loc[:,['average_rank',
                    'rank_difference',
                    "point_difference",
                    'is_stake',
                    "rating_difference",
                     "Age_difference",
                    "Weak_foot_difference",
                     "Skill_Moves_difference",
                    "Ball_Control_difference",
                     "Dribbling_difference",
                     "Marking_difference",
                     "Sliding_Tackle_difference",
                     "Standing_Tackle_difference",
                     "Aggression_difference",
                     "Reactions_difference",
                     "Interceptions_difference",
                     "Vision_difference",
                   "Crossing_difference",
                     "Short_Pass_difference",
                     "Long_Pass_difference",
                    "Stamina_difference",
                     "Penalties_difference",
                     "Acceleration_difference",                   
                     "Speed_difference",
                    "Strength_difference",
                    "Balance_difference",
                     "Agility_difference",
                     "Jumping_difference",
                    "Heading_difference",
                     "Shot_Power_difference",
                    "Finishing_difference",
                   "Long_Shots_difference",
                     "Curve_difference",
                    "Freekick_Accuracy_difference",
                     "Volleys_difference",
                     "Won_difference",
                     "Drawn_difference",
                     "Lost_difference",
                     "Average_points_difference",
                  ]]
y = matches['is_won']

In [43]:

Copied!

y = pd.get_dummies(y, drop_first = True)
y.head()
y = pd.get_dummies(y, drop_first = True)
y.head()

Out[43]:

	True
0	1
1	1
2	1
3	0
4	0

In [44]:

Copied!

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [45]:

Copied!

from sklearn.ensemble import RandomForestClassifier

pre_classifier = RandomForestClassifier()
pre_classifier.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier

pre_classifier = RandomForestClassifier()
pre_classifier.fit(X_train, y_train)

/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  after removing the cwd from sys.path.

Out[45]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [46]:

Copied!

pre_classifier.score(X_test, y_test)
pre_classifier.score(X_test, y_test)

Out[46]:

0.6023102310231023

We add the predictions of our Random Forest in our dataset to apply an XGBoost afterwards

In [47]:

Copied!

X_new = pd.concat([X, pd.DataFrame({"prediction_from_RF":pre_classifier.predict(X)})], axis=1)
X_new.head()
X_new = pd.concat([X, pd.DataFrame({"prediction_from_RF":pre_classifier.predict(X)})], axis=1)
X_new.head()

Out[47]:

	average_rank	rank_difference	is_stake	rating_difference	Age_difference	Weak_foot_difference	Skill_Moves_difference	Ball_Control_difference	Dribbling_difference	...	Finishing_difference	Long_Shots_difference	Curve_difference	Freekick_Accuracy_difference	Volleys_difference	Won_difference	Drawn_difference	Lost_difference	Average_points_difference	prediction_from_RF
0	40.5	37.0	True	-4.107843	1.605882	-0.234641	-0.264052	-7.986928	-6.426144	...	-6.547059	-8.129412	-10.150327	-7.005882	-9.759477	-22	-11	-14	-1.33	1
1	42.5	-17.0	True	-4.107843	1.605882	-0.234641	-0.264052	-7.986928	-6.426144	...	-6.547059	-8.129412	-10.150327	-7.005882	-9.759477	-22	-11	-14	-1.33	1
2	31.0	-26.0	True	-4.107843	1.605882	-0.234641	-0.264052	-7.986928	-6.426144	...	-6.547059	-8.129412	-10.150327	-7.005882	-9.759477	-22	-11	-14	-1.33	1
3	51.0	30.0	True	-4.107843	1.605882	-0.234641	-0.264052	-7.986928	-6.426144	...	-6.547059	-8.129412	-10.150327	-7.005882	-9.759477	-22	-11	-14	-1.33	0
4	53.0	26.0	True	-4.107843	1.605882	-0.234641	-0.264052	-7.986928	-6.426144	...	-6.547059	-8.129412	-10.150327	-7.005882	-9.759477	-22	-11	-14	-1.33	0

5 rows × 40 columns

In [48]:

Copied!

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.2)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.2)

In [49]:

Copied!

from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[49]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [50]:

Copied!

classifier.score(X_test, y_test)
classifier.score(X_test, y_test)

Out[50]:

0.8902640264026402

Conclusion¶

XGBoost is a powerful, flexible, and efficient machine learning algorithm that has become a staple in winning solutions in machine learning competitions and is widely used in industry. Its ability to handle missing data, built-in cross-validation, and efficient computation (due to parallel processing) sets it apart from other boosting algorithms. However, it can be memory-intensive and might require careful tuning of parameters. In this notebook, we have seen how to use XGBoost to predict the winner of the 2018 FIFA World Cup.

Keep in mind that predict the winner of the World Cup is a very difficult task firstly because is not based only on stats. Indeed, there are many factors that can influence the outcome of a match. For example, the weather, the referee, the state of mind of the players, the injuries, the suspensions, the tactical choices of the coaches, the motivation of the players, the public, the fatigue, the chance, the individual performances, the collective performances, the experience, the strategy, the training, the physical condition, the team cohesion, the team spirit, the team play and so many other things 🤓