For example, 3 on the intelligibility scale was described as "Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence".
The study concluded that, "highly reliable assessments can be made of the quality of human and machine translations".
The evaluation programme involved testing several systems based on different theoretical approaches; statistical, rule-based and human-assisted.
It was decided that this was not adequate for a standalone method of comparing systems and as such abandoned due to issues with the modification of meaning in the process of translating from English.
This was good from the point of view that the metric was "externally motivated",[3] since it was not specifically developed for machine translation.
However, the quality panel evaluation was very difficult to set up logistically, as it necessitated having a number of experts together in one place for a week or more, and furthermore for them to reach consensus.
This technique was found to cover the relevant parts of the quality panel evaluation, while at the same time being easier to deploy, as it didn't require expert judgment.
Measuring systems based on adequacy and fluency, along with informativeness is now the standard methodology for the ARPA evaluation program.
While not widely reported, it has been noted that the genre, or domain, of a text has an effect on the correlation obtained when using metrics.
Another important factor in the usefulness of an evaluation metric is to have a good correlation, even when working with small amounts of data, that is candidate sentences and reference translations.
[6] Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess; correlation, sensitivity, consistency, reliability and generality.
Finally, the metric must be general, that is it should work with different text domains, in a wide range of scenarios and MT tasks.
The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation.
The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text.
No other machine translation metric is yet to significantly outperform BLEU with respect to correlation with human judgment across language pairs.
NIST also differs from BLEU in its calculation of the brevity penalty, insofar as small variations in translation length do not impact the overall score as much.
The metric is based on the calculation of the number of words that differ between a piece of machine-translated text and a reference translation.
The experiments were tested on eight language pairs from ACL-WMT2011 including English-to-other (Spanish, French, German, and Czech) and the inverse, and showed that LEPOR yielded higher system-level correlation with human judgments than several existing metrics such as BLEU, Meteor-1.3, TER, AMBER and MP4IBM1.
The ACL-WMT13 Metrics shared task[15] results show that hLEPOR yields the highest Pearson correlation score with human judgment on the English-to-Russian language pair, in addition to the highest average-score on five language pairs (English-to-German, French, Spanish, Czech, Russian).
For automatic evaluations, they also did some clear classifications such as the lexical similarity methods, the linguistic features application, and the subfields of these two aspects.
Some state-of-the-art overview on both manual and automatic translation evaluation[20] introduced the recently developed translation quality assessment (TQA) methodologies, such as the crowd-sourced intelligence Amazon Mechanical Turk utilization, statistical significance testing, re-visiting traditional criteria with newly designed strategies, as well as MT quality estimation (QE) shared tasks from the annual workshop on MT (WMT)[21] and corresponding models that do not rely on human offered reference translations.