Common Crawl

Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions.

The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.

[8] Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012.

"[11] In 2013, Common Crawl began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler.

[16] In corroboration with SURFsara, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in Benelux.