TenTen Corpus Family

Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

This development was linked with the emergence of corpus creation tools that help achieve larger size, wider coverage, cleaner data etc.

[5] In a later stage, these texts undergo cleaning, which consists of removing any non-textual material such as navigation links, headers and footers from the HTML source code of web pages with the jusText tool,[6] so that only full solid sentences are preserved.

Eventually, the ONION tool[6] is applied to remove duplicate text portions from the corpus, which naturally occur on the World Wide Web due to practices such as quoting, citing, copying etc.

Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus.