Hamshahri Corpus

The Hamshahri Corpus (Persian: پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian-language newspapers in Iran.

It was initially collected and compiled by Ehsan Darrudi at DBRG Group[1] of University of Tehran.

Later, a team headed by Abolfazl AleAhmad[2] built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks.

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments.

It offers several new features and improvements: The corpus is available for download in XML format.

Hamshahri Corpus logo