Brier score

The Brier score is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes or classes.

The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1).

Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the square of the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1).

In the original (1950) formulation of the Brier score, the range is double, from zero to two.

The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but it is inappropriate for ordinal variables which can take on three or more values.

The above equation is a proper scoring rule only for binary events; if a multi-category forecast is to be evaluated, then the original definition given by Brier below should be used.

Then the Brier score is calculated as follows: Although the above formulation is the most widely used, the original definition by Brier[1] is applicable to multi-category forecasts as well as it remains a proper scoring rule, while the binary form (as used in the examples above) is only proper for binary events.

There are several decompositions of the Brier score which provide a deeper insight on the behavior of a binary classifier.

The Brier score can be decomposed into 3 additive components: Uncertainty, Reliability, and Resolution.

(Murphy 1973)[2] Each of these components can be decomposed further according to the number of possible classes in which the event can fall.

the observed climatological base rate for the event to occur,

The bold notation in the above formula indicates vectors, which is another way of denoting the original definition of the score and decomposing it according to the number of possible classes in which the event can fall.

Operations like the square and multiplication on these vectors are understood to be component wise.

The Brier Score is then the sum of the resulting vector on the right hand side.

Reliability is defined in the contrary direction compared to English language.

In the worst case, when the climatic probability is always forecast, the resolution is zero.

In the best case, when the conditional probabilities are zero and one, the resolution is equal to the uncertainty.

The second term is known as refinement, and it is an aggregation of resolution and uncertainty, and is related to the area under the ROC Curve.

The Brier Score, and the CAL + REF decomposition, can be represented graphically through the so-called Brier Curves,[3] where the expected loss is shown for each operating condition.

This makes the Brier Score a measure of aggregated performance under a uniform distribution of class asymmetries.

A skill score value less than zero means that the performance is even worse than that of the baseline or reference predictions.

is the Brier score of reference or baseline predictions which we seek to improve on.

While the reference predictions could in principle be given by any pre-existing model, by default one can use the naïve model that predicts the overall proportion or frequency of a given class in the data set being scored, as the constant predicted probability of that class occurring in each instance in the data set.

[5][6] In this default case, for binary (two-class) classification, the reference Brier score is given by (using the notation of the first equation of this article, at the top of the Definition section): where

is simply the average actual outcome, i.e. the overall proportion of true class 1 in the data set: With a Brier score, lower is better (it is a loss function) with 0 being the best possible score.

The Brier skill score can be more interpretable than the Brier score because the BSS is simply the percentage improvement in the BS compared to the reference model, and a negative BSS means you are doing even worse than the reference model, which may not be obvious from looking at the Brier score itself.

However, a BSS near 100% should not typically be expected because this would require that every probability prediction was nearly 0 or 1 (and was correct of course).

[7] Still, Murphy (1973)[8] proved that the BSS is asymptotically proper with a large number of samples.

You might notice that classification's (probability estimation's) BSS is to its BS, as regression's coefficient of determination (

[9] Wilks (2010) has found that "[Q]uite large sample sizes, i.e. n > 1000, are required for higher-skill forecasts of relatively rare events, whereas only quite modest sample sizes are needed for low-skill forecasts of common events.