Data Rights in the Age of Foundation Models

Contributor
Mar 13
9 min read

Updated: Jun 22

Somewhere in the weights of the large language model you used today is a compressed representation of billions of pieces of human-created work. Blog posts, books, code repositories, forum discussions, academic papers, social media posts, news articles, photographs, illustrations, music. Most of the people who created that work didn't consent to its use. Many don't know it happened. The legal frameworks for addressing this are still being written, in courtrooms, legislatures, and international bodies, and the outcomes will shape the economics of creative and intellectual work for decades.

This isn't a simple good-vs-evil story. It's a genuine collision between how technology works, how copyright law was designed, and how the economics of knowledge creation function. Understanding the nuances matters.

How Foundation Models Train on Your Data

Let's start with the mechanics, because the details matter for understanding the legal and ethical questions.

Foundation models are trained on massive datasets assembled by crawling the open web and aggregating existing datasets. Common Crawl, a nonprofit that archives web pages, has been a significant source for many training datasets. Books3, a dataset of approximately 196,000 books scraped from shadow library sites, was used in training several major models. The LAION datasets, containing billions of image-text pairs scraped from the web, underpinned many image generation models.

The process works roughly like this: automated systems crawl the web and download content. That content is processed, filtered, and organized into training datasets. The datasets are used to train models through a process that exposes the model to the data repeatedly, adjusting billions of parameters until the model can predict and generate text, images, code, or other outputs that resemble the patterns in the training data.

The resulting model doesn't store copies of the training data in the way a database stores records. Instead, it captures statistical patterns, relationships, and structures from across the entire dataset. But this distinction, between "storing" and "learning from," is less clear-cut than it might seem. Models can and do reproduce near-verbatim passages from training data under certain conditions. They capture styles, techniques, and creative choices in ways that are meaningfully derivative even when they're not literal copies.

The Scale Problem

The scale of data involved makes traditional consent-based frameworks nearly impossible to apply retroactively. A single training dataset might contain content from hundreds of millions of distinct sources. Identifying all the creators, determining rights holders, and obtaining consent would be operationally impractical for datasets that have already been assembled.

This doesn't mean consent is impossible going forward. It means that any practical solution needs to address both the retrospective question (what about data already used?) and the prospective question (how should data be handled from now on?).

The Copyright Lawsuits

The legal landscape is being shaped by a wave of copyright litigation that's still working its way through the courts.

The Major Cases

The New York Times sued OpenAI and Microsoft, alleging that ChatGPT and Bing Chat were trained on millions of Times articles without permission, and that the systems can reproduce Times content in ways that substitute for the original. This case is significant because the Times is a well-resourced plaintiff with strong copyright claims and clear evidence of near-verbatim reproduction.

Authors including Sarah Silverman, Christopher Golden, and Richard Kadrey filed class action lawsuits against Meta and OpenAI over the use of their books in training data. Getty Images sued Stability AI over the use of millions of copyrighted images in training Stable Diffusion. Music publishers have filed suits over the use of lyrics and compositions in training data.

These cases raise overlapping but distinct legal questions. Is training a model on copyrighted material a "copy" in the copyright sense? If so, is it fair use? Does the output of the model infringe when it resembles the style or content of training data? What about when it reproduces content near-verbatim?

The Fair Use Question

In US law, fair use is the central legal question. Fair use is determined by four factors: the purpose and character of the use (including whether it's transformative), the nature of the copyrighted work, the amount used, and the effect on the market for the original.

AI companies argue that training is transformative because the model doesn't copy the work; it learns abstract patterns that enable new creative output. They point to precedents like Google Books, where the courts found that digitizing books to create a search index was transformative fair use.

Opponents argue that the Google Books analogy fails because Google Books displayed only snippets and drove users to purchase the original works, while AI models can generate substitutes for the originals. They argue that the market impact is severe: why pay for a news article, a stock photo, or a commissioned illustration when an AI can generate a functionally equivalent output trained on exactly those works?

The courts haven't fully resolved this yet. Early rulings have gone different directions on different questions. The most likely outcome is that some uses will be found fair and others won't, with the key distinctions being how directly the output competes with the training data and how closely it reproduces specific copyrighted elements.

International Variations

Copyright law varies significantly across jurisdictions, and the AI training question is being answered differently in different countries.

The EU's approach, through the Copyright Directive and the AI Act, requires that rights holders have the ability to opt out of text and data mining. If a rights holder expressly reserves their rights (through robots.txt, metadata, or other machine-readable signals), AI companies must respect that reservation. This creates a notice-and-opt-out framework rather than requiring affirmative consent.

Japan has taken a more permissive approach, with provisions that allow AI training on copyrighted works for machine learning purposes. The UK considered and then stepped back from a similar broad exception for text and data mining.

These variations create jurisdictional complexity for global AI companies and global creators alike.

The Opt-Out Movement

In response to widespread scraping, a growing movement of creators and publishers has attempted to opt out of AI training data.

Technical Mechanisms

The primary technical mechanism is robots.txt, a file that websites can use to instruct web crawlers not to access certain content. Many major publishers and platforms have updated their robots.txt files to block known AI training crawlers.

The limitations are significant. Robots.txt is advisory, not enforceable at a technical level. Crawlers operated by companies that don't respect robots.txt will ignore it. Content that was already scraped before the robots.txt update is already in existing training datasets. And the proliferation of AI crawlers, some operated by companies that don't identify themselves clearly, makes it difficult to block all AI-related crawling.

Some platforms have introduced specific opt-out settings. DeviantArt added a "noai" tag. Various stock photo services updated their terms of service. Social media platforms have added settings related to AI training use of user content. But these mechanisms are inconsistent, sometimes buried in settings, and their legal enforceability varies.

The Consent Problem

The fundamental challenge is that the web was built on an implicit social contract: content posted publicly is accessible for humans to read. AI training stretches that social contract in ways its creators never anticipated. A photographer who posted their work on a personal website intended it to be viewed by visitors, not ingested by a training pipeline that would enable a system to generate competing images in their style.

Retroactive opt-out doesn't address the core issue, which is that consent should be prospective, not reactive. By the time a creator learns their work was used in training, the training has already happened. The statistical patterns derived from their work are embedded in model weights that can't be selectively erased (at least not with current technology).

GDPR Implications

For European creators and data subjects, GDPR adds another layer of complexity.

GDPR requires a legal basis for processing personal data. AI training that involves personal data, which it almost always does, given that training datasets contain names, biographical information, photos of identifiable people, and personal communications, needs to identify a valid legal basis.

The two most commonly cited bases are legitimate interest and consent. Consent is difficult at the scale of web scraping. Legitimate interest requires a balancing test between the company's interest in training the model and the data subject's rights and expectations.

Several European data protection authorities have investigated AI training data practices. The Italian data protection authority temporarily banned ChatGPT in 2023 over GDPR concerns, lifting the ban after OpenAI implemented certain changes. Other authorities have ongoing investigations.

The GDPR right to erasure creates a particularly thorny challenge. If a data subject requests deletion of their personal data, and that data was used in model training, does the company need to retrain the model without that data? Current technical limitations make targeted removal from model weights extremely difficult. "Machine unlearning" is an active area of research, but practical solutions at scale don't yet exist.

The interaction between GDPR and AI training remains largely unresolved, with guidance from data protection authorities still evolving and court cases still pending.

Creator Perspectives

The debate about AI training data is often framed as "creators vs. technology," but the creator community is not monolithic.

Some creators are fundamentally opposed to any use of their work in AI training without explicit consent and compensation. They view AI-generated content as derivative of their labor, created without permission, and competing directly with their livelihoods.

Other creators are more nuanced. They may be comfortable with AI training in some contexts (research, education, non-commercial use) but not others (commercial products that directly compete with their work). They want a licensing framework, not a ban.

Some creators actively embrace AI tools as part of their creative process, using AI-generated elements as starting points, references, or components of hybrid works. For these creators, restrictive training data rules could limit the tools available to them.

The economic dimension is real regardless of philosophical position. Illustrators report losing commission work to AI-generated images. Stock photography revenues have declined. Freelance writers face competition from AI-generated content. These impacts are concentrated on working creators rather than the major rights holders who have the resources to litigate.

What a Workable Framework Might Look Like

No framework will make everyone happy, but a workable approach probably includes several elements.

Prospective Licensing

Going forward, AI training on copyrighted material should operate through licensing frameworks that provide compensation to creators. Several models have been proposed, from individual licensing agreements to collective licensing schemes similar to those used in music.

The practical challenge is transaction costs. Individual licensing at the scale of foundation model training is infeasible. Collective licensing, through organizations analogous to ASCAP or BMI for music, is more practical but requires building new institutional infrastructure.

Some AI companies have started licensing deals with major publishers and content providers. These deals are a step in the right direction but tend to favor large, well-resourced rights holders over individual creators.

Retrospective Compensation

For data already used in training, some form of retrospective compensation may be appropriate, though the mechanism is debated. Class action settlements, statutory damages, or industry-funded compensation pools have all been proposed.

Meaningful Opt-Out

Opt-out mechanisms need to be standardized, technically enforceable, and legally binding. The current patchwork of robots.txt, platform settings, and metadata tags is insufficient. A standardized protocol that AI companies are legally required to respect would be a significant improvement.

Attribution and Provenance

As AI systems become better at generating content that closely resembles specific sources, attribution becomes more important. Technical approaches to content provenance, including watermarking, content credentials, and training data attribution, are being developed but are not yet mature.

Proportional Regulation

Not all AI training raises the same concerns. A research project training a model on public domain texts is different from a commercial product training on copyrighted works to generate competing content. Regulatory frameworks should be proportional to the actual impact.

The Bigger Picture

The AI training data debate is really a debate about the economics of knowledge creation. Human-created content, the writing, art, code, music, photography, and journalism that makes up training data, has enormous value. The question is whether AI companies can capture that value without compensating the people who created it.

If the answer is yes, if training on the entirety of human creative output without compensation is legally and practically normalized, the incentive to create that output diminishes. Why invest in original journalism if an AI can synthesize it? Why develop a distinctive artistic style if a model can replicate it on command?

If the answer is no, if creators have meaningful rights over the use of their work in AI training, then the AI industry needs to develop sustainable data supply chains that respect those rights while still enabling the technology to advance.

The courts, legislatures, and international bodies working on these questions right now are making decisions that will shape whether the knowledge economy has a viable future alongside AI, or whether the creation of the training data that AI depends on becomes economically unsustainable. The stakes are not theoretical. They're the livelihoods of millions of people who create the content that makes AI systems valuable.

ShiftQuality