In April, authors and publishers protested utilizing copyrighted books for AI training
Vuk Valcic/Alamy Live News
Amid legal battles, billions are at stake as courts in the US and UK deliberate on whether technology firms can legitimately train AI models using copyrighted literature. Numerous lawsuits have been filed by authors and publishers, revealing that at least one AI model has not only utilized popular texts for training but has also memorized portions of these works verbatim.
The crux of the dispute lies in whether AI developers hold the legal authority to employ copyrighted materials without obtaining prior permission. Previous research highlighted that many large language models (LLMs) powering popular AI chatbots were trained on the “Books3” dataset. Developers of these models argued they were not infringing copyright, claiming they were generating new combinations of words rather than directly reproducing the copyrighted content.
However, recent investigations have examined various AI models to determine the extent of verbatim recall from their training datasets. While most models did not retain exact texts, one particular model from Meta remembered nearly the entire text of a specific book. Should the ruling be unfavorable to the company, researchers predict damages could exceed $1 billion.
“AI models are not merely ‘plagiarism machines’ as some suggest; they do not just capture general relationships among words,” explained Mark Remley from Stanford University. “The diversity in responses among different models complicates the establishment of universal legal standards.”
Previously, Lemley defended Meta in a copyright case involving generative AI known as Kadrey V Meta Platforms. The plaintiff, whose works were used to train Meta’s AI models, filed a class-action lawsuit against the tech giant for copyright infringement. The case is currently under consideration in Northern California.
In January 2025, Remley announced he had parted ways with Meta as a client, yet he remains convinced of the company’s favorable chances in the lawsuit. Emile Vasquez, a Meta spokesperson, stated, “Fair use of copyrighted materials is crucial. We challenge the plaintiff’s claims, and the full record presents a different narrative.”
In this new study, Lemley and his team evaluated the memory capabilities of the AI by dividing excerpts from a small book into prefix and suffix segments, checking if a model prompted with the prefix could recall the suffix. For instance, one excerpt from F. Scott Fitzgerald’s The Great Gatsby was divided into a prefix that read, “They were careless people, Tom and Daisy—they broke things and creatures and then retreated,” and a suffix that concluded with, “We went back to money and their vast carelessness, which kept them together and allowed them to clean up any mess that other people had made.”
Researchers calculated the probability of each AI model completing the excerpt accurately and compared these probabilities against random chance.
The tested excerpts included selections from 36 copyrighted works, featuring popular titles by authors like George RR Martin’s Games and Cheryl Sandberg’s Lean In. Additionally, excerpts from books authored by plaintiffs in the Kadrey V Meta Platforms case were also examined.
The experiments involved 13 open-source AI models, including those created by Meta, Google, DeepMind, EleutherAI, and Microsoft. Most companies outside of Meta did not provide comments, with Microsoft opting not to comment.
The analysis revealed that Meta’s Llama 3.1 70b model had a significant recall of texts from JK Rowling’s first Harry Potter tome, as well as from The Great Gatsby and George Orwell’s 1984. Other models, however, showed minimal recall of the texts, including those penned by the plaintiffs. Meta declined to comment on these findings.
Researchers estimate that an AI model found to have infringed on merely 3% of the Books3 dataset could incur almost $1 billion in damages.
This technique has potential as a “forensic tool” for gauging the extent of AI memory, as noted by Randy McCarthy from Hallestill Law Office in Oklahoma. Yet, it does not address whether companies are legally permitted to train AI models on copyrighted works under US “fair use” provisions.
McCarthy points out that AI firms generally utilize copyrighted material for training. “The real question is whether they had the right to do so,” he remarked.
Meanwhile, in the UK, memory assessment is crucial from a copyright perspective, according to Robert Lands from Howard Kennedy Law Office in London. UK copyright legislation adheres to “fair dealing,” which presents much narrower allowances for copyright infringement compared to US fair use doctrine. Therefore, he posits that AI models retaining pirated content would not satisfy this exception.
Topics:
- artificial intelligence/
- Law
Source: www.newscientist.com