Europarl Corpus

In its first release in 2001, it covered eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish).

[1] With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data.

This latest version includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavic (Bulgarian, Czech, Polish, Slovak, Slovene), Finno-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

By measuring the output of the systems against the original corpus data for the target language the adequacy of the translation can be assessed.

Koehn uses the BLEU metric by Papineni et al. (2002) for this, which counts the coincidences of the two compared versions—SMT output and corpus data—and calculates a score on this basis.