Statistical association football predictions

Publications about statistical models for football predictions started appearing from the 90s, but the first model was proposed much earlier by Moroney,[2] who published his first statistical analysis of soccer match results in 1956.

The series of ball passing between players during football matches was successfully analyzed using negative binomial distribution by Reep and Benjamin [3] in 1968.

They improved this method in 1971, and in 1974 Hill [4] indicated that soccer game results are to some degree predictable and not simply a matter of chance.

The first model predicting outcomes of football matches between teams with different skills was proposed by Michael Maher [5] in 1982.

According to his model, the goals, which the opponents score during the game, are drawn from the Poisson distribution.

The model parameters are defined by the difference between attacking and defensive skills, adjusted by the home field advantage factor.

The methods for modeling the home field advantage factor were summarized in an article by Caurneya and Carron [6] in 1992.

He used recursive Bayesian estimation to rate football teams: this method was more realistic in comparison to soccer prediction based on common average statistics.

All the prediction methods can be categorized according to tournament type, time-dependence and regression algorithm.

Football prediction methods vary between Round-robin tournament and Knockout competition.

The methods for Knockout competition are summarized in an article by Diego Kuonen.

The method is based on the assumption that the rating assigned to the rival teams is proportional to the outcome of each match.

Assume that the teams A, B, C and D are playing in a tournament and the match outcomes are as follows: Though the ratings

has full rank, the algebraic solution of the system may be found via the Least squares method: If not, one can use the Moore–Penrose pseudoinverse to get: The final rating parameters are

The advantage of this rating method compared to the standard ranking systems is that the numbers are continuously scaled, defining the precise difference between the teams’ strengths.

refers to attacking and defensive strengths and to home field advantage respectively.

are correction factors which represent the means of goals scored during the season by home and away teams.

Assuming that C signifies the number of teams participating in a season and N stands for the number of matches played until now, the team strengths can be estimated by minimizing the negative log-likelihood function with respect to

that minimize the negative log-likelihood can be estimated by Expectation Maximization: Improvements for this model were suggested by Mark Dixon (statistician) and Stuart Coles.

[10] They invented a correlation factor for low scores 0-0, 1–0, 0-1 and 1-1, where the independent Poisson model doesn't hold.

Dimitris Karlis and Ioannis Ntzoufras [11] built a Time-Independent Skellam distribution model.

On the one hand, statistical models require a large number of observations to make an accurate estimation of its parameters.

And when there are not enough observations available during a season (as is usually the situation), working with average statistics makes sense.

On the other hand, it is well known that team skills change during the season, making model parameters time-dependent.

Mark Dixon (statistician) and Coles [10] tried to solve this trade-off by assigning a larger weight to the latest match results.

Rue and Salvesen [12] introduced a novel time-dependent rating method using the Markov Chain model.

then represents the psychological effects caused by underestimation of the opposing teams’ strength.

refer to the loss of memory rate and to the prior attack variance respectively.

: B-C, the joint probability density can be expressed as: Since analytical estimation of the parameters is difficult in this case, the Monte Carlo method is applied to estimate the parameters of the model.

Marek, Ťoupal and Šedivá (2014)[13] build on research of Maher (1982),[5] Dixon and Coles (1997),[10] and others who used models for association football.