In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models raised questions about whether copyright infringement occurs when such are trained or used.
[1] Popular deep learning models are trained on mass amounts of media scraped from the Internet, often utilizing copyrighted material.
This difference in approach can be seen in the recent decision in respect of a registration claim by Jason Matthew Allen for his work Théâtre D'opéra Spatial created using Midjourney and an upscaling tool.
Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'.
Copyright Office published a Rule in March 2023 on a range of issues related to the use of AI, where they stated: ...because the Office receives roughly half a million applications for registration each year, it sees new trends in registration activity that may require modifying or expanding the information required to be disclosed on an application.
[13][14] In the subsequent rule-making, the USPTO allows for human inventors to incorporate the output of artificial intelligence, as long as this method is appropriately documented in the patent application.
[8] Deep learning models source large data sets from the Internet such as publicly available images and the text of web pages.
[27] IP scholars Bryan Casey and Mark Lemley argue in the Texas Law Review that datasets are so large that "there is no plausible option simply to license all [of the data...].
[25] One of the earliest case to challenge the nature of fair use for training AI was a lawsuit that Thomson Reuters brought against Ross Intelligence first filed in 2020.
While Thomson Reuters' claims were initially denied by judge Stephanos Bibas of the Third Circuit on the basis that headnotes may not have been copyrightable, Bibas reevaluated his decision in February 2025 and issued a ruling favoring Thomson Reuters, in that headnotes are copyrightable, and that Ross Intelligence, which had since closed down in 2021, had inappropriately used the material.
"[33] Indian copyright law provides fair use exceptions for scientific research, but lacks specific provisions for commercial AI training models.
Unlike the EU and UK, India has not established text and data mining (TDM) provisions that explicitly address commercial AI systems.
This regulatory uncertainty became apparent in 2024 when Asian News International (ANI) sued OpenAI for using its content to train AI models without authorization.
The case also highlighted jurisdictional challenges, as OpenAI argued it was not subject to Indian law because its servers and training operations were located outside the country.
[37] Memorization is the emergent phenomenon of LLMs to repeat long strings of training data, and it is no longer related to overfitting.
[38] Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates[39] or up to about 7%.
[41] As of August 2023[update], major consumer LLMs have attempted to mitigate these problems, but researchers have still been able to prompt leakage of copyrighted material.