Document Type

Article

Publication Date

2024

Abstract

The rise in the popularity of consumer-facing generative artificial intelligence (GenAI) has created considerable confusion and consternation among some copyright owners. The ability to automate the generation of original works based on user input is considered by some copyright holders to have been made possible by large-scale direct infringement by OpenAI, Microsoft, and other major GenAI developers. This article explores the application of copyright law to the training of OpenAI’s ChatGPT, specifically focusing on the legal issues surrounding the unauthorized use of copyrighted textual works in the GenAI training process.

The large language models (LLMs) that drive ChatGPT and similar GenAI can summarize written works, generate movie scripts, write poetry, and compose stories nearly instantaneously. LLMs can only function in this way due to the use of vast, diverse training datasets comprised of billions of websites and expansive repositories of books. These datasets are analyzed to study the functionality and syntax of the language, allowing the LLMs to generate new works.

This article discusses the recent lawsuits launched by high-profile authors and copyright owners against OpenAI and Microsoft, claiming direct, vicarious, and derivative infringement. Authors such as George RR Martin, Sarah Silverman, Christopher Golden, and professional organizations such as the Authors Guild contended their works were infringed upon to turn OpenAI into an $80 billion company.

In considering the merits of these lawsuits, we discuss the curation and content of training datasets used in the known iterations of ChatGPT and characterize the protectability of the different works the datasets included. We then explore whether the transitory nature of OpenAI’s training process uses acceptable, non-infringing copies and how that would affect the outcome of an action for direct infringement.

The article then looks at the applicability of current fair use precedent to textual GenAI and the various types of works used in training datasets. To do so, we apply settled caselaw and leading decisions to discuss OpenAI’s use of copyrighted works regarding purpose and character, nature of the original work, the amount and substantiality of the works used, and the impact on the market value of the works by ChatGPT. We pay special attention to other innovative technologies that rely on a fair use defense to draw analogies and comparisons to GenAI.

Finally, this article considers the policy and legislation of other countries and their approach to ChatGPT and copyright. In doing so, policy considerations are taken into account to argue the necessity of a finding of fair use to maintain international competitiveness and to prevent an erosion of fair use in other sectors outside of GenAI. The article concludes that there is substantial support for arguments that GenAI training involves only transitory, non-actionable copying and that it is also permissible under fair use.

Share

COinS