Copyright Infringement and Generative AI: Training Data, Outputs, and Thaler

Generative artificial intelligence has produced a cluster of copyright cases that will shape intellectual property law for a generation. The claims divide naturally into two categories — training liability (can building an AI on copyrighted data constitute infringement?) and output liability (can an AI-generated work infringe a copyrighted original, and who is responsible?) — with an anterior question about human authorship running through both. Plaintiffs' counsel entering this space must understand the current state of each doctrinal track and the open questions courts have not yet resolved.

I. The Human-Authorship Requirement: Thaler v. Perlmutter

The predicate question — whether works generated by artificial intelligence without human creative input are protectable by copyright — has been answered definitively at the circuit level. In Thaler v. Perlmutter, No. 23-5233 (D.C. Cir. Mar. 18, 2025), the U.S. Court of Appeals for the District of Columbia Circuit unanimously affirmed the denial of copyright registration for an image autonomously generated by Dr. Stephen Thaler's AI system, the "Creativity Machine."

The D.C. Circuit held that the Copyright Act of 1976 "requires all eligible work to be authored in the first instance by a human being." Id. at 3–4. Writing for a unanimous panel, Judge Millett grounded the holding in textual analysis: multiple provisions of the Copyright Act presuppose a human author — ownership, inheritance, duration based on the author's lifespan, signature requirements, and the mens rea requirements of joint authorship all only make sense if the "author" is a person. "Machines do not have property, traditional human lifespans, family members, domiciles, nationalities, mentes reae, or signatures." Id. at 14.

The D.C. Circuit affirmed the district court's 2023 decision, Thaler v. Vidal [sic Thaler v. Perlmutter], 609 F. Supp. 3d 140 (D.D.C. 2023) (Howell, J.), which had characterized human authorship as copyright's "bedrock requirement." Id.

*What Thaler does not resolve. The court expressly declined to address whether works in which AI plays an assistive* role — generated through significant human direction, selection, curation, and modification — can be copyrighted by the human author. The Copyright Office's March 2023 Registration Guidance, 88 Fed. Reg. 16,190 (Mar. 16, 2023), and subsequent guidance have addressed this question administratively: registration depends on whether and to what extent a human being made protectable creative choices. That line-drawing exercise is ongoing in the Copyright Office and will eventually reach the courts.

II. Training Data Liability: Andersen v. Stability AI

The largest and most consequential pending litigation concerns whether building generative AI models on copyrighted works — scraping and ingesting billions of images, books, or articles to train a model — constitutes direct copyright infringement.

In Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal.), a group of visual artists filed a putative class action against Stability AI (maker of Stable Diffusion), Midjourney, DeviantArt, and later Runway AI, alleging that their copyrighted artworks were incorporated into the training datasets without license or consent. After the court dismissed an initial complaint in part, plaintiffs filed an amended complaint, and on August 12, 2024, Judge William Orrick denied the defendants' motions to dismiss the direct copyright infringement and inducement claims.

Key holdings from the August 2024 order:

Direct infringement survived. The court allowed plaintiffs' "model theory" (the trained AI model itself constitutes an infringing copy because it embodies transformations of the plaintiffs' works) and "distribution theory" (distributing the model is equivalent to distributing copies of the works). Judge Orrick acknowledged the issue as "unsettled" but found the allegations plausible. He noted that "[t]hat these works may be contained in Stable Diffusion as algorithmic or mathematical representations — and are therefore fixed in a different medium than they may have originally been produced in — is not an impediment to the claim at this juncture."

Induced infringement against Stability survived. The court found it plausible that Stability promoted Stable Diffusion to infringe copyrights; the product was built using copyrighted materials and end-user operation allegedly creates infringing copies.

DMCA Section 1202 claims dismissed with prejudice. Both Section 1202(a) (false copyright management information) and 1202(b) (removal of CMI) claims were dismissed. The court found that outputs were not "identical" to plaintiffs' works and that there was no actionable removal of CMI during the training process. This is a significant defense victory on the CMI front that limits DMCA-based theories in AI training cases.

The trial in Andersen is currently set to begin in September 2026. Discovery will illuminate the mechanics of how Stable Diffusion was built and whether plaintiffs' works are "contained" in the model in a legally cognizable sense.

The most prominent AI copyright cases involve large language models trained on text. In New York Times Co. v. Microsoft Corp., No. 23-cv-11195 (S.D.N.Y.), filed December 2023, The New York Times alleged that OpenAI and Microsoft copied millions of articles to train GPT models. The lawsuit brought claims for direct copyright infringement, vicarious and contributory infringement, and DMCA violations.

In an April 4, 2025 decision on motions to dismiss, Judge Sidney Stein denied OpenAI's motion to dismiss direct infringement claims as time-barred, holding that OpenAI had not met its burden of showing that The Times discovered the alleged infringement more than three years before filing. The court denied motions to dismiss contributory infringement claims, allowed trademark dilution claims to proceed, but dismissed common law unfair competition by misappropriation claims with prejudice and granted dismissal of certain DMCA § 1202(b)(1) claims.

The case is now in discovery with a preservation order controversy over ChatGPT conversation logs that may have litigation-wide significance for AI discovery practice.

Parallel class actions filed by The Authors Guild and individual authors — Authors Guild v. OpenAI Inc., No. 1:23-cv-08292-SHS (S.D.N.Y.) — remain pending. These cases present the same fundamental training-data liability questions in the text context.

IV. Fair Use Analysis Applied to AI Training

The central defense in all training-data litigation is fair use under 17 U.S.C. § 107. The four statutory factors are:

Purpose and character of the use. AI developers argue that training is transformative — the model does not reproduce the original work in recognizable form but learns statistical patterns and weights. Courts that have found transformative use in search indexing (Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015)) and text-data analysis are cited by analogy. The Supreme Court's recent Andy Warhol Foundation v. Goldsmith, 598 U.S. 508 (2023), significantly tightened the transformativeness inquiry, emphasizing the commercial purpose and functional substitution of the use. Post-Goldsmith, commercial AI training may be harder to defend as transformative.

Nature of the copyrighted work. Highly creative works — fine art, literary fiction, photography — are at the core of copyright protection. AI training datasets overwhelmingly consist of exactly this category.

Amount taken. Training typically involves ingesting entire works. Courts have held that wholesale copying weighs against fair use, and the model theory of liability in Andersen is partly premised on the argument that the entire copyrighted work is embedded in the trained model's weights.

Market effect. This factor is often decisive. AI companies argue their models do not substitute for the originals in any market; plaintiffs argue that AI-generated outputs directly compete in the same creative markets and that the potential for AI to be licensed for training purposes creates a recognized market that defendants have usurped without payment. The NYT case involves direct evidence of outputs that closely reproduce Times articles — a strong market-substitution argument.

V. Output Liability vs. Training Liability

Training liability (discussed above) focuses on the act of building the model. Output liability — whether AI-generated text or images infringe existing copyrights — presents different doctrinal questions.

Direct output infringement. Courts applying Feist Publications v. Rural Telephone Service, 499 U.S. 340 (1991), require substantial similarity between the AI output and a copyrighted original. Where outputs are statistically close reproductions of specific works — the "memorization" problem documented in the NYT complaint — the direct-infringement theory is strongest.

Who is the direct infringer? If the user prompts the AI to reproduce a copyrighted work, the user may be the direct infringer; the AI developer could be secondarily liable under contributory or vicarious theories. If the AI autonomously generates an infringing output without specific user prompting, the developer's liability turns on whether it designed the system to reproduce copyrighted content.

Secondary liability. Vicarious liability requires the ability to control the infringing activity and a direct financial benefit from it. Contributory liability requires knowledge of the infringement and material contribution to it. Both theories are being tested in the pending litigation.

*The Thaler dimension. Because purely AI-generated outputs are not copyrightable (Thaler*), a claim that a defendant's AI output infringes a plaintiff's copyright presupposes that the plaintiff's original work is the protected work — not the AI output itself. But the AI company's training data is the plaintiff's copyrighted work, and infringing outputs are the product of that training. This dual structure creates potentially strong liability for companies that trained on uncleared content.

VI. Practice Considerations for Plaintiffs' Counsel

Identify the theory. Training liability and output liability require different evidence. Training claims require technical discovery into what datasets were used, whether rights clearances were obtained, and how the model architecture encodes training data. Output claims require comparison analysis and, where outputs are unusually similar to specific works, expert testimony on memorization and extraction.

Class certification. Training-data claims are structurally suited to class litigation: the same training dataset affects all artists in the same category, and the legal questions are common. Andersen is proceeding as a putative class. Class certification on damages — given the diversity of copyright holders and the difficulty of apportioning harm — will be genuinely contested.

Expert witnesses. AI copyright cases require expert witnesses with technical expertise in how large language models and diffusion models store and recall training data, as well as copyright experts who can address substantial similarity and fair use.

DMCA § 512 safe harbor. Platforms hosting AI-generated content may invoke the § 512 safe harbor for user-uploaded infringement. Whether AI companies qualify for § 512 as operators rather than uploaders is a contested threshold question.

VII. Where the Law Is Moving

Andy Warhol Foundation v. Goldsmith's restrictive reading of transformative use, applied to AI training, likely weakens the fair use defense. The Copyright Office's ongoing AI and Copyright reporting processes — including a Part II report issued in 2025 on copyrightability of AI-generated works — will shape both legislative and judicial responses. Congress is unlikely to legislate quickly, leaving courts to develop the doctrine through the pending cases in the Southern District of New York and the Northern District of California over the next several years.

Talk to Yates Anderson

If you are litigating a matter in this area — or weighing whether to — the working analysis above only goes so far. Request a case evaluation and a Yates Anderson attorney will respond within one business day.

Informational only. Not legal advice. No attorney-client relationship is created by reading this post. Consult a licensed attorney in your jurisdiction.