BLEU

Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU.

[1] Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality,[2][3] and remains one of the most popular automated and inexpensive metrics.

Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.

Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.

[5] A basic, first attempt at defining the BLEU score would take two arguments: a candidate string

Since in natural language processing, one should evaluate a large set of candidate strings, one must generalize the BLEU score to the case where one has a list of M candidate strings (called a "corpus")

Define the modified n-gram precision function to be

To work up to this expression, we start with the most obvious n-gram count summation:

This count summation cannot be used to compare between sentences, since it is not normalized.

The modified n-gram precision unduly gives a high score for candidate strings that are "telegraphic", that is, containing all the n-grams of the reference strings, but for as few times as possible.

In order to punish candidate strings that are too short, define the brevity penalty to be

There is not a single definition of BLEU, but a whole family of them, parametrized by the weighting vector

In words, it is a weighted geometric mean of all the modified n-gram precisions, multiplied by the brevity penalty.

We use the weighted geometric mean, rather than the weighted arithmetic mean, to strongly favor candidate corpuses that are simultaneously good according to multiple n-gram precisions.

is number of words from the candidate that are found in the reference, and

This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references.

The modification that BLEU makes is fairly straightforward.

For each word in the candidate translation, the algorithm takes its maximum total count,

This sum is then divided by the total number of unigrams in the candidate translation.

In the above example, the modified unigram precision score would be: In practice, however, using individual words as the unit of comparison is not optimal.

Instead, BLEU computes the same modified precision metric using n-grams.

The length which has the "highest correlation with monolingual human judgements"[6] was found to be four.

The unigram scores are found to account for the adequacy of the translation, how much information is retained.

The longer n-gram scores account for the fluency of the translation, or to what extent it reads like "good English".

However, in the version of the metric used by NIST evaluations prior to 2009, the shortest reference sentence had been used instead.)

iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations.

[9] BLEU has frequently been reported as correlating well with human judgement,[10][11][12] and remains a benchmark for the assessment of any new evaluation metric.

It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries.

[13] Designed to be used for several reference translation, in practice it's used with only the single one.

[2] BLEU is infamously dependent on the tokenization technique, and scores achieved with different ones are incomparable (which is often overlooked); in order to improve reproducibility and comparability, SacreBLEU variant was designed.