We ran 100,000 computer simulations of the World Cup. And the winner will be …

Source: The Conversation – USA (2)

Paul the Octopus opted for Spain against the Netherlands in 2010. But how do his predictive skills compare to machine learning?

Roland Weihrauch/DPA/AFP via Getty Images In times past, when we wanted to know which team would win the World Cup, we had to turn to seers with crystal balls, use divination via tea leaves, or hope for Paul the Octopus to tell us what would happen.

But modern data science can provide a better alternative. As part of a team of statisticians, I helped train a machine learning algorithm to predict the most likely course of the tournament. Probabilistic forecasts and loaded dice The algorithm we built proceeds in two steps.

In the first, sophisticated statistical models and expert insight from bookmakers and transfer markets are combined to determine the strengths of all teams and their players. In the second step, a machine learning algorithm decides how to best combine the strength estimates with other information about the teams.

This produced a probabilistic forecast for each possible match in the tournament. It can be thought of as a pair of loaded dice: Instead of having the numbers 1 to 6 with equal probabilities, these loaded dice have different probabilities for the number of goals for either team.

For example, according to our forecast, Mexico has a die rolling 1.9 goals on average in the opening match, whereas opponent South Africa has an average of only 0.7. But this does not mean that Mexico will surely win.

Rather, a win for Mexico is the most likely outcome with 65% probability. A draw is less likely (21%), and a win for South Africa is the least likely outcome (14%). ‘Vuelve a casa, el fútbol vuelve a casa!’ Using different pairs of loaded dice, the result of each match in the World Cup can be simulated.

We took into account the official tournament draw and all FIFA rules, including the possibility of overtime and penalty shootouts. We ran the simulation 100,000 times to determine the tournament’s most likely course. The results show that Spain is the favorite for the title with a winning probability of 14.5%, closely followed by England and France, each at 12.4%, and Germany at 11.2%.

Due to the expanded tournament – this World Cup has 48 teams and five rounds in the knockout stage – this group of favorites is tightly packed. Portugal and Argentina also have good chances to win the title, at 8.9% and 8.2%, respectively.

For its part, the United States has a good chance of reaching the Round of 32: 78%. This is the highest in their group, which has three other teams. In the knockout stage, however, when every match is do or die, the probabilities of the U.S. team “surviving” go down relatively quickly.

The probability for a home victory in the final at MetLife Stadium in New Jersey on July 19 is 1%. A deeper peek into the engine room Our machine learning algorithm and subsequent simulations are fueled by data, expert knowledge and statistical models.

First, all national matches over the past eight years are the basis for a “retrospective” estimate of the teams’ strengths. Second, a “prospective” strength estimate is obtained from quoted odds of various international bookmakers, reflecting their expert opinions about the upcoming tournament.

Third, ratings of the individual players are produced based on their contributions to goals at the club and national levels. And finally, the current quality and future potential of the players is reflected in their expected market values.

These are available from the Transfermarkt website that uses a wisdom-of-the crowd approach to estimate the unknown real-market values. These four variables are combined with a broad range of further relevant inputs reflecting the current states of the different teams and the countries they come from.

This includes team-specific details, such as their FIFA rank and the number of players in the semifinals of this year’s Champions League. We also factored in country-specific socioeconomic factors, such as GDP per capita.

To determine if and how these features are relevant for the actual results in a World Cup, a machine learning algorithm was used. Here, a so-called random forest is trained, consisting of lots of decision trees capturing slightly different subsets of the data.

The algorithm has been trained on all matches played at the major soccer tournaments since World Cup 2006. It thus links a team’s strength, market value and other factors to the number of goals scored in matches at World Cups.

This is the information that loads the dice for our simulations.

Find out more This is not the first time that our team comprising Andreas Groll and Rouven Michels and colleagues at TU Dortmund University in Germany, Lars Magnus Hvattum at Norway’s Molde University College, Gunther Schauberger at TU Munich and I have collaborated to forecast a World Cup.

In the 2019 Women’s World Cup we correctly predicted the U.S. as the winner. In the 2023 Women’s World Cup and the 2022 men’s World Cup, the winners – Spain and Argentina, respectively – were not our favorites, although we did predict them to be serious contenders.

The bottom line is forecasts are about probabilities.

Our program will not predict the winner with 100% certainty – but it might do better than an eight-limbed mollusk.

Achim Zeileis does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

Original source: https://analysis1.mil-osi.com/2026/06/09/we-ran-100000-computer-simulations-of-the-world-cup-and-the-winner-will-be/