XGBoost for World Cup Winner prediction¶
This notebook demonstrates how to use XGBoost to predict the winner of the 2018 FIFA World Cup. The data used for this notebook is from this Kaggle competition.
XGBoost¶
XGBoost (for eXtreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. The primary principle behind boosting is to focus on the hard-to-classify instances and get them right.
General idea of XGBoost¶
Objective Function¶
The objective function in XGBoost is given by : $$Obj(\theta)=L(\theta)+ \omega(\theta)$$
Where:
- $L(\theta)$ is the loss function.
- $\omega(\theta)$ is the regularization term.
Important Features of XGBoost¶
Handling Missing Data¶
XGBoost has an in-built routine to handle missing values. During training, the model learns the best direction to take for missing values.
Parallel Processing¶
XGBoost is designed to be distributed and can harness the power of multi-core computers, which makes it faster than other implementations of gradient boosting.
Tree Pruning¶
Unlike other boosting algorithms that build trees depth-wise, XGBoost constructs the tree depth-first and prunes it using the max_depth parameter.
Built-in Cross-Validation¶
XGBoost allows the user to run a cross-validation at each iteration of the boosting process, which helps in selecting the best number of boosting rounds.
Regularization¶
XGBoost includes L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents overfitting.
Hyperparameters¶
Some of the important hyperparameters in XGBoost include:
- learning_rate or eta: Step size shrinkage used to prevent overfitting.
- max_depth: Maximum depth of the tree.
- subsample: Fraction of samples used per tree.
- colsample_bytree: Fraction of features used per tree.
- n_estimators: Number of boosting rounds.
- objective: Specifies the learning task and the corresponding objective function.
Advantages and Disadvantages¶
Advantages¶
- Computationally efficient with parallel processing.
- Regularization helps avoid overfitting.
- Handles missing values.
- Can be used for classification, regression, ranking, and user-defined prediction tasks.
Disadvantages¶
- Might require careful tuning of parameters.
- Can be memory-intensive.
- Less interpretable than simpler models.
XGBoost in Python¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
matches = pd.read_csv("./Datasets/results.csv")
rankings = pd.read_csv("./Datasets/fifa_ranking.csv")
world_cup_matches = pd.read_csv("./Datasets/World Cup 2018 Dataset.csv")
players = pd.read_csv("./Datasets/FullData.csv")
all_time_stats = pd.read_csv("./Datasets/all_time_fifa_statistics.csv")
We don't need all the data in every file. Some country names differ depending on the year (Germany counted as two countries before the fall of the Berlin Wall in 1989). So we are going to start a first phase of cleaning
rankings = rankings.loc[:,['rank',
'country_full',
'country_abrv',
'cur_year_avg_weighted',
'rank_date',
'two_year_ago_weighted',
'three_year_ago_weighted']]
rankings = rankings.replace({"IR Iran": "Iran"})
rankings['weighted_points'] = rankings['cur_year_avg_weighted'] + rankings['two_year_ago_weighted'] + rankings['three_year_ago_weighted']
rankings["rank_date"] = pd.to_datetime(rankings["rank_date"])
rankings.describe()
rank | cur_year_avg_weighted | two_year_ago_weighted | three_year_ago_weighted | weighted_points | |
---|---|---|---|---|---|
count | 57793.000000 | 57793.000000 | 57793.000000 | 57793.000000 | 57793.000000 |
mean | 101.628086 | 61.798602 | 17.933277 | 11.834811 | 91.566691 |
std | 58.618424 | 138.014883 | 40.888849 | 27.106675 | 197.891852 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 51.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 101.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 152.000000 | 32.250000 | 6.450000 | 4.250000 | 64.810000 |
max | 209.000000 | 1158.660000 | 347.910000 | 240.150000 | 1511.500000 |
matches = matches.replace({"Germany DR": "Germany", "China": "China PR"})
matches["date"] = pd.to_datetime(matches["date"])
world_cup_matches = world_cup_matches.loc[:, ['Team',
'Group',
'First match \nagainst',
'Second match\n against',
'Third match\n against']]
world_cup_matches = world_cup_matches.dropna(how='all')
world_cup_matches = world_cup_matches.replace({"IRAN": "Iran",
"Costarica": "Costa Rica",
"Porugal": "Portugal",
"Columbia": "Colombia",
"Korea" : "Korea Republic"})
world_cup_matches = world_cup_matches.set_index('Team')
world_cup_matches.head()
Group | First match against | Second match against | Third match against | |
---|---|---|---|---|
Team | ||||
Russia | A | Saudi Arabia | Egypt | Uruguay |
Saudi Arabia | A | Russia | Uruguay | Egypt |
Egypt | A | Uruguay | Russia | Saudi Arabia |
Uruguay | A | Egypt | Saudi Arabia | Russia |
Portugal | B | Spain | Morocco | Iran |
matches.head()
date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | |
---|---|---|---|---|---|---|---|---|---|
0 | 1872-11-30 | Scotland | England | 0 | 0 | Friendly | Glasgow | Scotland | False |
1 | 1873-03-08 | England | Scotland | 4 | 2 | Friendly | London | England | False |
2 | 1874-03-07 | Scotland | England | 2 | 1 | Friendly | Glasgow | Scotland | False |
3 | 1875-03-06 | England | Scotland | 2 | 2 | Friendly | London | England | False |
4 | 1876-03-04 | Scotland | England | 3 | 0 | Friendly | Glasgow | Scotland | False |
Given the amount of data we have and the little missing data, we decide to simply erase the lines where there is missing data. Let's finish importing player stats.
players = players.loc[:, ["Nationality",
"Rating",
"Age",
"Weak_foot",
"Skill_Moves",
"Ball_Control",
"Dribbling",
"Marking",
"Sliding_Tackle",
"Standing_Tackle",
"Aggression",
"Reactions",
"Attacking_Position",
"Interceptions",
"Vision",
"Composure",
"Crossing",
"Short_Pass",
"Long_Pass",
"Acceleration",
"Speed",
"Stamina",
"Strength",
"Balance",
"Agility",
"Jumping",
"Heading",
"Shot_Power",
"Finishing",
"Long_Shots",
"Curve",
"Freekick_Accuracy",
"Penalties",
"Volleys"]]
players.describe()
Rating | Age | Weak_foot | Skill_Moves | Ball_Control | Dribbling | Marking | Sliding_Tackle | Standing_Tackle | Aggression | ... | Agility | Jumping | Heading | Shot_Power | Finishing | Long_Shots | Curve | Freekick_Accuracy | Penalties | Volleys | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | ... | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 | 17588.000000 |
mean | 66.166193 | 25.460314 | 2.934103 | 2.303161 | 57.972766 | 54.802877 | 44.230327 | 45.565499 | 47.441096 | 55.920173 | ... | 63.206732 | 64.918524 | 52.393109 | 55.581192 | 45.157607 | 47.403173 | 47.181146 | 43.383443 | 49.165738 | 43.275586 |
std | 7.083012 | 4.680217 | 0.655927 | 0.746156 | 16.834779 | 18.913857 | 21.561703 | 21.515179 | 21.827815 | 17.445464 | ... | 14.618163 | 11.430807 | 17.473703 | 17.600155 | 19.374428 | 19.211887 | 18.464396 | 17.701903 | 15.871735 | 17.710839 |
min | 45.000000 | 17.000000 | 1.000000 | 1.000000 | 5.000000 | 4.000000 | 3.000000 | 5.000000 | 3.000000 | 2.000000 | ... | 11.000000 | 15.000000 | 4.000000 | 3.000000 | 2.000000 | 4.000000 | 6.000000 | 4.000000 | 7.000000 | 3.000000 |
25% | 62.000000 | 22.000000 | 3.000000 | 2.000000 | 53.000000 | 47.000000 | 22.000000 | 23.000000 | 26.000000 | 44.000000 | ... | 55.000000 | 58.000000 | 45.000000 | 45.000000 | 29.000000 | 32.000000 | 34.000000 | 31.000000 | 39.000000 | 30.000000 |
50% | 66.000000 | 25.000000 | 3.000000 | 2.000000 | 63.000000 | 60.000000 | 48.000000 | 51.000000 | 54.000000 | 59.000000 | ... | 65.000000 | 65.000000 | 56.000000 | 59.000000 | 48.000000 | 52.000000 | 48.000000 | 42.000000 | 50.000000 | 44.000000 |
75% | 71.000000 | 29.000000 | 3.000000 | 3.000000 | 69.000000 | 68.000000 | 64.000000 | 64.000000 | 66.000000 | 70.000000 | ... | 74.000000 | 73.000000 | 65.000000 | 69.000000 | 61.000000 | 63.000000 | 62.000000 | 57.000000 | 61.000000 | 57.000000 |
max | 94.000000 | 47.000000 | 5.000000 | 5.000000 | 95.000000 | 97.000000 | 92.000000 | 95.000000 | 92.000000 | 96.000000 | ... | 96.000000 | 95.000000 | 94.000000 | 93.000000 | 95.000000 | 91.000000 | 92.000000 | 93.000000 | 96.000000 | 93.000000 |
8 rows × 33 columns
players = players.dropna(how="all")
grouped = players.groupby(["Nationality"], as_index = False)
players = grouped.aggregate(np.mean)
The end of the part of the code is used to calculate the average of the statistics of the players in each team so that we can then integrate them into the comparison between each country.
Merge data¶
Our data is now imported but we will need to merge it so that our algorithm can learn from the different statistics. It will have to be done in several steps.
First, the ranks and the dates of the matches do not correspond exactly. Indeed, we have the ranks month-to-month while we have a day-to-day date for the matches. It will therefore be necessary to create a day-to-day classification so that we can merge our columns.
Once this is done, we do a first merge.
rankings.head()
rank | country_full | country_abrv | cur_year_avg_weighted | rank_date | two_year_ago_weighted | three_year_ago_weighted | weighted_points | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Germany | GER | 0.0 | 1993-08-08 | 0.0 | 0.0 | 0.0 |
1 | 2 | Italy | ITA | 0.0 | 1993-08-08 | 0.0 | 0.0 | 0.0 |
2 | 3 | Switzerland | SUI | 0.0 | 1993-08-08 | 0.0 | 0.0 | 0.0 |
3 | 4 | Sweden | SWE | 0.0 | 1993-08-08 | 0.0 | 0.0 | 0.0 |
4 | 5 | Argentina | ARG | 0.0 | 1993-08-08 | 0.0 | 0.0 | 0.0 |
rankings = rankings.set_index(['rank_date'])\
.groupby(['country_full'], group_keys=False)\
.resample('D').first()\
.fillna(method='ffill')\
.reset_index()
rankings.head()
rank_date | rank | country_full | country_abrv | cur_year_avg_weighted | two_year_ago_weighted | three_year_ago_weighted | weighted_points | |
---|---|---|---|---|---|---|---|---|
0 | 2003-01-15 | 204.0 | Afghanistan | AFG | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 2003-01-16 | 204.0 | Afghanistan | AFG | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2003-01-17 | 204.0 | Afghanistan | AFG | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2003-01-18 | 204.0 | Afghanistan | AFG | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 2003-01-19 | 204.0 | Afghanistan | AFG | 0.0 | 0.0 | 0.0 | 0.0 |
rankings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1830658 entries, 0 to 1830657 Data columns (total 8 columns): rank_date datetime64[ns] rank float64 country_full object country_abrv object cur_year_avg_weighted float64 two_year_ago_weighted float64 three_year_ago_weighted float64 weighted_points float64 dtypes: datetime64[ns](1), float64(5), object(2) memory usage: 111.7+ MB
matches = matches.merge(rankings,
left_on=['date', 'home_team'],
right_on=['rank_date', 'country_full'])
matches.head()
matches = matches.merge(rankings,
left_on=['date', 'away_team'],
right_on=['rank_date', 'country_full'],
suffixes=('_home', '_away'))
matches = matches.merge(players,
left_on =["home_team"],
right_on = ["Nationality"])
matches = matches.merge(players,
left_on = ['away_team'],
right_on = ["Nationality"],
suffixes = ('_home', "_away"))
matches = matches.merge(all_time_stats,
left_on = ["home_team"],
right_on = ["Country"])
matches = matches.merge(all_time_stats,
left_on = ["away_team"],
right_on = ["Country"],
suffixes = ("_home", "_away"))
matches.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6057 entries, 0 to 6056 Columns: 119 entries, date to Best_finish_away dtypes: bool(1), datetime64[ns](3), float64(78), int64(22), object(15) memory usage: 5.5+ MB
How will we evaluate the different teams that compete? A simple way is to take the difference of each stats between the teams. For example, we will take the difference in position in the FIFA rankings, the age difference between the players etc. This process is a bit tedious because you will have to do everything by hand, but here it is:
matches['rank_difference'] = matches['rank_home'] - matches['rank_away']
matches['average_rank'] = (matches['rank_home'] + matches['rank_away'])/2
matches['score_difference'] = matches['home_score'] - matches['away_score']
matches["point_difference"] = matches['weighted_points_home'] - matches['weighted_points_away']
matches["rating_difference"] = matches["Rating_home"] - matches["Rating_away"]
matches["Age_difference"] = matches["Age_home"] - matches["Age_away"]
matches["Weak_foot_difference"] = matches["Weak_foot_home"] - matches["Weak_foot_away"]
matches["Skill_Moves_difference"] = matches["Skill_Moves_home"] - matches["Skill_Moves_away"]
matches["Ball_Control_difference"] = matches["Ball_Control_home"] - matches["Ball_Control_away"]
matches["Dribbling_difference"] = matches["Dribbling_home"] - matches["Dribbling_away"]
matches["Marking_difference"] = matches["Marking_home"] - matches["Marking_away"]
matches["Sliding_Tackle_difference"] = matches["Sliding_Tackle_home"] - matches["Sliding_Tackle_away"]
matches["Standing_Tackle_difference"] = matches["Standing_Tackle_home"] - matches["Standing_Tackle_away"]
matches["Aggression_difference"] = matches["Aggression_home"] - matches["Aggression_away"]
matches["Reactions_difference"] = matches["Reactions_home"] - matches["Reactions_away"]
matches["Attacking_Position_difference"] = matches["Attacking_Position_home"] - matches["Attacking_Position_away"]
matches["Interceptions_difference"] = matches["Interceptions_home"] - matches["Interceptions_away"]
matches["Vision_difference"] = matches["Vision_home"] - matches["Vision_away"]
matches["Composure_difference"] = matches["Composure_home"] - matches["Composure_away"]
matches["Crossing_difference"] = matches["Crossing_home"] - matches["Crossing_away"]
matches["Short_Pass_difference"] = matches["Short_Pass_home"] - matches["Short_Pass_away"]
matches["Long_Pass_difference"] = matches["Long_Pass_home"] - matches["Long_Pass_away"]
matches["Stamina_difference"] = matches["Stamina_home"] - matches["Stamina_away"]
matches["Penalties_difference"] = matches["Penalties_home"] - matches["Penalties_away"]
matches["Acceleration_difference"] = matches["Acceleration_home"] - matches["Acceleration_away"]
matches["Speed_difference"] = matches["Speed_home"] - matches["Speed_away"]
matches["Strength_difference"] = matches["Strength_home"] - matches["Strength_away"]
matches["Balance_difference"] = matches["Balance_home"] - matches["Balance_away"]
matches["Agility_difference"] = matches["Agility_home"] - matches["Agility_away"]
matches["Jumping_difference"] = matches["Jumping_home"] - matches["Jumping_away"]
matches["Heading_difference"] = matches["Heading_home"] - matches["Heading_away"]
matches["Shot_Power_difference"] = matches["Shot_Power_home"] - matches["Shot_Power_away"]
matches["Finishing_difference"] = matches["Finishing_home"] - matches["Finishing_away"]
matches["Long_Shots_difference"] = matches["Long_Shots_home"] - matches["Long_Shots_away"]
matches["Curve_difference"] = matches["Curve_home"] - matches["Curve_away"]
matches["Freekick_Accuracy_difference"] = matches["Freekick_Accuracy_home"] - matches["Freekick_Accuracy_away"]
matches["Volleys_difference"] = matches["Volleys_home"] - matches["Volleys_away"]
matches["Part's_difference"] = matches["Part's_home"] - matches["Part's_away"]
matches["Played_difference"] = matches["Played_home"] - matches["Played_away"]
matches["Won_difference"] = matches["Won_home"] - matches["Won_away"]
matches["Drawn_difference"] = matches["Drawn_home"] - matches["Drawn_away"]
matches["Lost_difference"] = matches["Lost_home"] - matches["Lost_away"]
matches["Goal_Difference_difference"] = matches["Goal Difference_home"] - matches["Goal Difference_away"]
matches["Points_difference"] = matches["Points_home"] - matches["Points_away"]
matches["Average_points_difference"] = matches["Average_points_home"] - matches["Average_points_away"]
matches['is_won'] = matches['score_difference'] > 0 # take draw as lost
matches['is_stake'] = matches['tournament'] != 'Friendly'
The management of each of our variables that will follow will also be somewhat long and there are certainly ways to manage this in a better way, but, due to time constraints, we preferred to proceed this way.
Building the model¶
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
X = matches.loc[:,['average_rank',
'rank_difference',
"point_difference",
'is_stake',
"rating_difference",
"Age_difference",
"Weak_foot_difference",
"Skill_Moves_difference",
"Ball_Control_difference",
"Dribbling_difference",
"Marking_difference",
"Sliding_Tackle_difference",
"Standing_Tackle_difference",
"Aggression_difference",
"Reactions_difference",
"Interceptions_difference",
"Vision_difference",
"Crossing_difference",
"Short_Pass_difference",
"Long_Pass_difference",
"Stamina_difference",
"Penalties_difference",
"Acceleration_difference",
"Speed_difference",
"Strength_difference",
"Balance_difference",
"Agility_difference",
"Jumping_difference",
"Heading_difference",
"Shot_Power_difference",
"Finishing_difference",
"Long_Shots_difference",
"Curve_difference",
"Freekick_Accuracy_difference",
"Volleys_difference",
"Won_difference",
"Drawn_difference",
"Lost_difference",
"Average_points_difference",
]]
y = matches['is_won']
y = pd.get_dummies(y, drop_first = True)
y.head()
True | |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
3 | 0 |
4 | 0 |
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.ensemble import RandomForestClassifier
pre_classifier = RandomForestClassifier()
pre_classifier.fit(X_train, y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning) /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). after removing the cwd from sys.path.
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
pre_classifier.score(X_test, y_test)
0.6023102310231023
We add the predictions of our Random Forest in our dataset to apply an XGBoost afterwards
X_new = pd.concat([X, pd.DataFrame({"prediction_from_RF":pre_classifier.predict(X)})], axis=1)
X_new.head()
average_rank | rank_difference | point_difference | is_stake | rating_difference | Age_difference | Weak_foot_difference | Skill_Moves_difference | Ball_Control_difference | Dribbling_difference | ... | Finishing_difference | Long_Shots_difference | Curve_difference | Freekick_Accuracy_difference | Volleys_difference | Won_difference | Drawn_difference | Lost_difference | Average_points_difference | prediction_from_RF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.5 | 37.0 | 0.0 | True | -4.107843 | 1.605882 | -0.234641 | -0.264052 | -7.986928 | -6.426144 | ... | -6.547059 | -8.129412 | -10.150327 | -7.005882 | -9.759477 | -22 | -11 | -14 | -1.33 | 1 |
1 | 42.5 | -17.0 | 0.0 | True | -4.107843 | 1.605882 | -0.234641 | -0.264052 | -7.986928 | -6.426144 | ... | -6.547059 | -8.129412 | -10.150327 | -7.005882 | -9.759477 | -22 | -11 | -14 | -1.33 | 1 |
2 | 31.0 | -26.0 | 0.0 | True | -4.107843 | 1.605882 | -0.234641 | -0.264052 | -7.986928 | -6.426144 | ... | -6.547059 | -8.129412 | -10.150327 | -7.005882 | -9.759477 | -22 | -11 | -14 | -1.33 | 1 |
3 | 51.0 | 30.0 | 0.0 | True | -4.107843 | 1.605882 | -0.234641 | -0.264052 | -7.986928 | -6.426144 | ... | -6.547059 | -8.129412 | -10.150327 | -7.005882 | -9.759477 | -22 | -11 | -14 | -1.33 | 0 |
4 | 53.0 | 26.0 | 0.0 | True | -4.107843 | 1.605882 | -0.234641 | -0.264052 | -7.986928 | -6.426144 | ... | -6.547059 | -8.129412 | -10.150327 | -7.005882 | -9.759477 | -22 | -11 | -14 | -1.33 | 0 |
5 rows × 40 columns
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.2)
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)
classifier.score(X_test, y_test)
0.8902640264026402
Conclusion¶
XGBoost is a powerful, flexible, and efficient machine learning algorithm that has become a staple in winning solutions in machine learning competitions and is widely used in industry. Its ability to handle missing data, built-in cross-validation, and efficient computation (due to parallel processing) sets it apart from other boosting algorithms. However, it can be memory-intensive and might require careful tuning of parameters. In this notebook, we have seen how to use XGBoost to predict the winner of the 2018 FIFA World Cup.
Keep in mind that predict the winner of the World Cup is a very difficult task firstly because is not based only on stats. Indeed, there are many factors that can influence the outcome of a match. For example, the weather, the referee, the state of mind of the players, the injuries, the suspensions, the tactical choices of the coaches, the motivation of the players, the public, the fatigue, the chance, the individual performances, the collective performances, the experience, the strategy, the training, the physical condition, the team cohesion, the team spirit, the team play and so many other things 🤓