[1] It was the main corpus used to train the initial GPT model by OpenAI,[2] and has been used as training data for other early large language models including Google's BERT.
[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.
[3] The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books".
[4] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.
[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.