OpenAI in fair use dispute: How much transformation is in ChatGPT?

Developers of generative AI tools still operate within a gray area with regard to copyright when training their language models. However, their reliance on the fair use principle could carry weight considering previous legal precedents.

OpenAI in fair use dispute: How much transformation is in ChatGPT?

2023 was a breakthrough year for generative artificial intelligence (GenAI) in several respects. Investors worldwide injected over 22 billion dollars in growth capital into the technology. This amount was more than five times the amount invested the previous year, according to data service provider Dealroom. Additionally, significant milestones in regulation were achieved, such as the EU's agreement on the world's first comprehensive AI law.

Nonetheless, many fundamental questions remain unanswered, especially those regarding the copyright treatment of training data and the content generated by text, image, or video generators. Can these models simply be fed with copyrighted works and thus prepared for commercial use? Or does it require a license?

This question naturally concerns those who derive their income from creating such works, including artists, photographers, authors, and journalists. It culminated at the end of December in a sensational lawsuit filed by The New York Times against OpenAI, the developer of ChatGPT, and its main investor Microsoft. The newspaper alleges that the startup unlawfully used content from millions of articles to train the AI and thereby caused billions of dollars in damages to The Times.

Google Books as a precursor

In response, OpenAI invokes the fair use principle, a provision under US law that allows for the legal use of copyrighted material in certain cases. Such cases include research and educational purposes, as well as reporting and commentary. In general, it seeks to balance the interests of authors and those of the public, a goal that has held significant weight in previous legal rulings.

"US courts have long dealt with the incorporation of copyrighted content from the internet and have interpreted the fair use provision quite broadly. They may continue to do so in the future," says Alexander Duisberg, a legal expert in digitalization and data economy at the law firm Ashurst. He cites the case of Google vs. Authors Guild, in which the search engine giant and provider of the book search service Google Books prevailed in 2015 – ten years after the case began.

Alexander Duisberg is a partner at Ashurst LLP in the Digital Economy sector and one of the leading legal experts in the field of digitization and data economy. He has frequently contributed to expert hearings held by the EU Commission regarding the EU Digital Strategy.

Google had digitized entire books for its service launched in 2002, tagged them with keywords, and provided users with short excerpts, known as snippets. The authors saw this as a "massive copyright infringement." But the courts viewed it as a "transformative use" that improved public access to knowledge.

"This argument could also be invoked in favor of OpenAI," states Duisberg. "However, it requires that ChatGPT does not simply reproduce protected original texts."

But that's precisely what The Times accuses the startup of doing. The company's tools are capable of "generating output that faithfully reproduces the newspaper's content," The Times claims in the lawsuit. OpenAI tries to downplay the accusation, stating that reproducing protected content is a "rare error" they are addressing. Moreover, The Times allegedly violated usage terms to obtain lengthy excerpts of articles.

Other publishers could follow suit

Nevertheless, the pressure on OpenAI's investors to address the issue of unintended replication is likely substantial. Microsoft has invested a whopping 13 billion dollars in the startup so far. "If the fair use principle does not apply to the language models in the eyes of the courts, more publishing houses could follow The Times and enforce their copyrights," says Duisberg. In theory, this could eventually impact the quality of ChatGPT's results – namely, if the language model is trained almost exclusively on internet content that is purely synthetically generated and therefore not copyrighted.

Regardless of the outcome, Duisberg sees more opportunities than risks for media companies through generative AI. "Content generators significantly amplify information overload, and the proportion of synthetically generated content with unverified accuracy is increasing," notes the digital law expert. "As a result, the role of media in ensuring credibility alongside information curation is changing. This role of the trusted information mediator becomes even more crucial in the age of generative AI."