Tehran Monolingual Corpus

TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.

The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.

TMC comprises more than 250 million words.

The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.

TMC is created by Natural Language Processing Lab of University of Tehran.