Bradley–Terry model

Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as where pi is a positive real-valued score assigned to individual i.

The comparison i > j can be read as "i is preferred to j", "i ranks higher than j", or "i beats j", depending on the application.

For example, pi might represent the skill of a team in a sports tournament and

[1][2] Or pi might represent the quality or desirability of a commercial product and

The Bradley–Terry model can be used in the forward direction to predict outcomes, as described, but is more commonly used in reverse to infer the scores pi given an observed set of outcomes.

[2] In this type of application pi represents some measure of the strength or quality of

and the model lets us estimate the strengths from a series of pairwise comparisons.

Based on a set of such pairwise comparisons, the Bradley–Terry model can then be used to derive a full ranking of the wines.

Once the values of the scores pi have been calculated, the model can then also be used in the forward direction, for instance to predict the likely outcome of comparisons that have not yet actually occurred.

The model is named after Ralph A. Bradley and Milton E. Terry,[3] who presented it in 1952,[4] although it had already been studied by Ernst Zermelo in the 1920s.

[1][5][6] Applications of the model include the ranking of competitors in sports, chess, and other competitions,[7] the ranking of products in paired comparison surveys of consumer choice, analysis of dominance hierarchies within animal and human communities,[8] ranking of journals, ranking of AI models,[9] and estimation of the relevance of documents in machine-learned search engines.

Bradley and Terry themselves defined exponential score functions

This formulation highlights the similarity between the Bradley–Terry model and logistic regression.

; in ranking under the Bradley–Terry model one knows the functional form and attempts to infer the parameters.

, the PL model can be sampled by the "exponential race" method.

reduces to the BT model, and in general, for any subset

The most common application of the Bradley–Terry model is to infer the values of the parameters

The simplest way to estimate the parameters is by maximum likelihood estimation, i.e., by maximizing the likelihood of the observed outcomes given the model and parameter values.

and the log-likelihood of the parameter vector p = [p1, ..., pn] is[1] Zermelo[5] showed that this expression has only a single maximum, which can be found by differentiating with respect to

and setting the result to zero, which leads to This equation has no known closed-form solution, but Zermelo suggested solving it by simple iteration.

Starting from any convenient set of (positive) initial values for the

The resulting parameters are arbitrary up to an overall multiplicative constant, so after computing all of the new values they should be normalized by dividing by their geometric mean thus: This estimation procedure improves the log-likelihood on every iteration, and is guaranteed to eventually reach the unique maximum.

This iteration gives identical results to the one in (3) but converges much faster and hence is normally preferred over (3).

[15] Consider a sporting competition between four teams, who play a total of 22 games among themselves.

To do this, we initialize the four entries in the parameter vector p arbitrarily, for example assigning the value 1 to each team: [1, 1, 1, 1].

Repeating a further 10 times gives rapid convergence toward a final solution of p = [0.640, 1.043, 0.660, 2.270].

The Crowd-BT model, developed in 2013 by Chen et al,[16] attempts to extend the standard Bradley–Terry model for crowdsourced settings while reducing the number of comparisons needed by taking into account the reliability of each judge.

In particular, it identifies and excludes judges presumed to be spammers (selecting choices at random) or malicious (selecting always the wrong choice).

In a crowdsourced task of ranking documents by reading difficulty with 624 judges contributing up to 40 pairwise comparisons each, Crowd-BT was shown to outperform both standard Bradley–Terry as well as ranking system TrueSkill.

It has been recommended for use when quality results are valued over efficiency and the number of comparisons is high.