Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set.
At the sentence level, the maximum correlation with human judgement achieved was 0.403.
Precision and recall are combined using the harmonic mean in the following fashion, with recall weighted 9 times more than precision: The measures that have been introduced so far only account for congruity with respect to single words but not with respect to larger segments that appear in both the reference and the candidate sentence.
In order to take these into account, longer n-gram matches are used to compute a penalty p for the alignment.
The more mappings there are that are not adjacent in the reference and the candidate sentence, the higher the penalty will be.
The longer the adjacent mappings between the candidate and the reference, the fewer chunks there are.
To calculate a score over a whole corpus, or collection of segments, the aggregate values for P, R and p are taken and then combined using the same formula.
In this case the algorithm compares the candidate against each of the references and selects the highest score.