6 Prin. L.J.F. ____

Who Owns the Future? The Copyright Clash Shaping Generative AI

Cheick Sy


VOLUME 6

ISSUE 1

Fall 2025

In the last year alone, OpenAI, Meta, and other AI developers have been hit with a cascade of copyright lawsuits, ranging from The New York Times, to prominent novelists, and even stock-photo companies. These cases have rapidly become a legal battleground for generative AI, raising urgent questions about whether tech companies can continue relying on traditional legal shields, namely the fair use doctrine and Section 230 of the Communications Decency Act, to justify training models on massive collections of copyrighted text, images, and audio scraped from the open web. At stake is not only the future of AI innovation but also the economic survival of news outlets, authors, artists, and other creators whose work fuels these AI models. As courts begin chipping away at the industry’s longstanding assumptions, copyright litigation is emerging as the unexpected mechanism through which the U.S. is developing a common-law regulatory framework for AI, one case at a time. The reality is that these tech companies cannot rely on their traditional legal shields anymore because the current litigation reveals fundamental flaws in their application to generative AI: the “fair use” defense is increasingly strained under the weight of unprecedented commercial market harm, and Section 230 was never designed to immunize a company’s own systemic copying in creating its core product. This article substantiates that claim by analyzing recent federal decisions—including Authors Guild v. Google, Andy Warhol Foundation v. Goldsmith, Thomson Reuters v. Ross Intelligence Inc., and ongoing litigation such as The New York Times v. OpenAI—to demonstrate how courts are applying the four-factor fair use test and limiting Section 230 in ways that increasingly constrain unlicensed AI training.

Generative AI models such as GPT-4 are trained on vast datasets consisting of text and images, many of which originate from copyrighted sources. Training typically involves making multiple copies of works so the model can analyze linguistic or visual patterns. These “intermediate copies” are not distributed to the public, but they are essential to producing the final model. Plaintiffs argue that this constitutes wholesale, unauthorized reproduction to create a product that may compete directly with their own work.

The legal framework for this battle rests on three pillars. First, copyright law grants creators exclusive rights to reproduce, distribute, and prepare derivative works based on their original creations. Second, the fair use doctrine, codified in 17 U.S.C. § 107, permits certain unauthorized uses of copyrighted material. In determining unauthorized uses, Courts evaluate four factors: (1) purpose and character of the use (including whether it is “transformative”), (2) nature of the copyrighted work, (3) amount and substantiality of the portion used, and (4) effect on the potential market for or value of the original. Third, Section 230 of the Communications Decency Act has historically immunized “interactive computer services” from liability for content posted by their users. Historically, fair use has flexibly accommodated technological innovations like search engine indexing. The central question now is whether training generative AI is analogous to these earlier technologies or represents a fundamentally distinct use that demands different treatment.

AI companies rely heavily on courts adopting a broad notion of “transformative use,” arguing they repurpose copyrighted materials to extract unprotectable patterns and not to reproduce expressive content; thus, in this sense, they are being transformative. This logic succeeded in Authors Guild v. Google, where Google Books’ creation of a searchable index was deemed transformative fair use because it served a research purpose without substituting for the books themselves. However, generative AI presents a critical distinction: while search engines direct users to original sources, AI models can output content that directly competes with those sources. A Meta model famously reproduced 42% of a Harry Potter book, and AI tools can generate detailed article summaries that eliminate the need to visit the original publisher’s site. At first glance, this concern invites comparisons to longstanding summary services such as CliffNotes, news digests, or book reviews, which have generally coexisted with copyright law. But generative AI departs from these analogues in several critical respects. Traditional summaries are selective, human-authored, and often licensed or clearly transformative in purpose, typically directing readers back to the original work. Generative AI systems, by contrast, produce personalized, on-demand outputs at massive scale that are trained through the wholesale copying of entire works rather than limited excerpts. This scale, automation, and expressive mimicry fundamentally alters the market effect analysis, transforming what might otherwise be permissible commentary into a potentially substitutive product.

This capacity for market substitution undermines the core rationale of the Google Books precedent. Courts evaluating fair use do not require plaintiffs to prove completed economic devastation; rather, they assess whether the challenged use threatens a traditional or reasonably foreseeable licensing market. In The New York Times v. OpenAI, the court has permitted expansive discovery into training data and AI outputs, signaling that claims of market substitution and lost licensing opportunities are legally identifiable rather than speculative. The relevant harm inquiry thus focuses not only on whether AI outputs currently replace original works but on whether unlicensed training forecloses emerging markets for authorized AI use. This is precisely the type of market copyright law is designed to protect.

The Supreme Court’s decision in Andy Warhol Foundation v. Goldsmith has become an anchor for challenging broad AI fair-use claims. The Court ruled that Warhol’s commercial licensing of a silkscreen based on a photographer’s portrait was not fair use despite its different artistic style because it served the same “commercial function” as the original—licensing for magazine illustrations. This reasoning directly implicates generative AI: even if an AI’s output is stylistically different, if it serves the same commercial function as the training material (i.e., providing news summaries, creating illustrations, or generating prose), its “transformative” character diminishes. When combined with other fair use factors, such as the highly expressive nature of the copied novels and journalism, which courts traditionally afford stronger protection under the second factor, the wholesale copying of entire works, and the demonstrable threat to creators’ markets—the defense appears increasingly unstable.

Parallel weaknesses are emerging in attempts to invoke Section 230. Historically successful in protecting platforms like Facebook or YouTube from liability for user posts, Section 230 faces a poor fit when applied to AI training. The statute’s core purpose was to encourage moderation by shielding platforms from liability for third-party content. AI training, however, is not user behavior; it is a foundational, proprietary process undertaken by the company itself. No user instructs OpenAI to ingest the New York Times archive; that is a corporate decision to build a product. Courts are thus likely to view the training process as the company’s own act of reproduction, not as the hosting of user-generated content. Even for outputs, Section 230 offers little refuge when an AI reproduces verbatim copyrighted material, as The New York Times demonstrated in their case with ChatGPT. The defense collapses when the output is a direct, verifiable copy of protected input, proving the model contains and can regurgitate the infringing training data.

The mounting legal pressure is already pushing AI companies toward negotiated licensing solutions. OpenAI’s licensing agreements with The Associated Press, Axel Springer, and several small publishers demonstrate that the industry is shifting away from indiscriminate web scraping toward structured data acquisition. This emerging ecosystem resembles a market-based regulatory framework. AI companies gain predictable, high-quality training data. Publishers and creators receive compensation. Courts establish boundaries through case law rather than waiting on comprehensive federal legislation. Notably, Congress remains gridlocked on AI regulation, and sweeping statutory reforms may be years away. 

Recent court decisions confirm that fair use is far from a settled or categorical defense for AI training and, in some instances, has already failed. In Thomson Reuters v. Ross Intelligence Inc., a federal court rejected a fair use defense where an AI company trained its legal research tool on Westlaw’s proprietary headnotes to develop a competing product. Applying the four-factor test, the court emphasized the defendant’s commercial purpose and the direct substitutionary threat posed to Westlaw’s market, concluding that such use was not transformative and caused perceptible market harm. Although Ross Intelligence did not involve a large language model, its reasoning squarely applies to generative AI systems that ingest copyrighted works to create functionally equivalent products. Other courts have allowed fair use defenses to survive only narrowly and provisionally, underscoring that success hinges on the absence of proven market harm rather than a judicial endorsement of AI training as inherently lawful. Litigation is producing immediate, tailored guidance similar to how early internet law evolved. It is very intuitive to predict that the common law of AI will resemble the judicial development of privacy and intellectual-property norms in the early 2000s.

Opponents warn that strict copyright enforcement could stifle AI innovation, raising barriers to entry. This argument echoes past debates where courts permitted intermediate copying for technologies like search engines and software reverse-engineering because they were transformative and non-substitutive. Generative AI is distinguishable precisely because its outputs can substitute for the originals in the marketplace. The law is not rejecting innovation but demanding it proceed through licensed channels that respect creators’ economic interests. The emerging judicial consensus suggests a pragmatic path forward: AI development can continue, but not on the back of unlicensed, wholesale appropriation that undermines the very creative ecosystems it relies upon.

Copyright litigation has become the United States’ most immediate and effective tool for regulating generative AI. The tech industry’s reliance on expansive fair-use interpretations and outdated Section 230 theories is faltering under judicial scrutiny that recognizes the unprecedented commercial threat posed by unchecked data ingestion. The resulting common-law framework is neither anti-technology nor creator-hostile; rather, it reflects a faithful application of copyright’s core principles—preserving exclusive rights in expressive works while permitting innovation only where it satisfies the statutory requirements of transformation and the absence of market substitution. The open question remains whether Congress will eventually codify these emerging principles or whether courts, through the deliberate pace of precedent, will remain the primary architects of the generative age. What is increasingly clear is that the era of assuming immunity for the mass ingestion of copyrighted works is drawing to a close as courts demand concrete evidence of transformation and the absence of market harm before extending fair use protection.  


Topics: ,