Meta's AI Memorable Book Verbatim – Can Cost Billions

In April, authors and publishers protested utilizing copyrighted books for AI training

Vuk Valcic/Alamy Live News

Amid legal battles, billions are at stake as courts in the US and UK deliberate on whether technology firms can legitimately train AI models using copyrighted literature. Numerous lawsuits have been filed by authors and publishers, revealing that at least one AI model has not only utilized popular texts for training but has also memorized portions of these works verbatim.

The crux of the dispute lies in whether AI developers hold the legal authority to employ copyrighted materials without obtaining prior permission. Previous research highlighted that many large language models (LLMs) powering popular AI chatbots were trained on the “Books3” dataset. Developers of these models argued they were not infringing copyright, claiming they were generating new combinations of words rather than directly reproducing the copyrighted content.

However, recent investigations have examined various AI models to determine the extent of verbatim recall from their training datasets. While most models did not retain exact texts, one particular model from Meta remembered nearly the entire text of a specific book. Should the ruling be unfavorable to the company, researchers predict damages could exceed $1 billion.

“AI models are not merely ‘plagiarism machines’ as some suggest; they do not just capture general relationships among words,” explained Mark Remley from Stanford University. “The diversity in responses among different models complicates the establishment of universal legal standards.”

Previously, Lemley defended Meta in a copyright case involving generative AI known as Kadrey V Meta Platforms. The plaintiff, whose works were used to train Meta’s AI models, filed a class-action lawsuit against the tech giant for copyright infringement. The case is currently under consideration in Northern California.

In January 2025, Remley announced he had parted ways with Meta as a client, yet he remains convinced of the company’s favorable chances in the lawsuit. Emile Vasquez, a Meta spokesperson, stated, “Fair use of copyrighted materials is crucial. We challenge the plaintiff’s claims, and the full record presents a different narrative.”

In this new study, Lemley and his team evaluated the memory capabilities of the AI by dividing excerpts from a small book into prefix and suffix segments, checking if a model prompted with the prefix could recall the suffix. For instance, one excerpt from F. Scott Fitzgerald’s The Great Gatsby was divided into a prefix that read, “They were careless people, Tom and Daisy—they broke things and creatures and then retreated,” and a suffix that concluded with, “We went back to money and their vast carelessness, which kept them together and allowed them to clean up any mess that other people had made.”

Researchers calculated the probability of each AI model completing the excerpt accurately and compared these probabilities against random chance.

The tested excerpts included selections from 36 copyrighted works, featuring popular titles by authors like George RR Martin’s Games and Cheryl Sandberg’s Lean In. Additionally, excerpts from books authored by plaintiffs in the Kadrey V Meta Platforms case were also examined.

The experiments involved 13 open-source AI models, including those created by Meta, Google, DeepMind, EleutherAI, and Microsoft. Most companies outside of Meta did not provide comments, with Microsoft opting not to comment.

The analysis revealed that Meta’s Llama 3.1 70b model had a significant recall of texts from JK Rowling’s first Harry Potter tome, as well as from The Great Gatsby and George Orwell’s 1984. Other models, however, showed minimal recall of the texts, including those penned by the plaintiffs. Meta declined to comment on these findings.

Researchers estimate that an AI model found to have infringed on merely 3% of the Books3 dataset could incur almost $1 billion in damages.

This technique has potential as a “forensic tool” for gauging the extent of AI memory, as noted by Randy McCarthy from Hallestill Law Office in Oklahoma. Yet, it does not address whether companies are legally permitted to train AI models on copyrighted works under US “fair use” provisions.

McCarthy points out that AI firms generally utilize copyrighted material for training. “The real question is whether they had the right to do so,” he remarked.

Meanwhile, in the UK, memory assessment is crucial from a copyright perspective, according to Robert Lands from Howard Kennedy Law Office in London. UK copyright legislation adheres to “fair dealing,” which presents much narrower allowances for copyright infringement compared to US fair use doctrine. Therefore, he posits that AI models retaining pirated content would not satisfy this exception.

Topics:

artificial intelligence/
Law

Source: www.newscientist.com

What's Hot

Astronomers May Have Detected a Gas Giant Still Forming Around RIK 113

Everyone in the city must have a noise canceling device or perhaps even a pet spider.

Request was not able to be met

“Online Misinformation Surrounds La Ice Protest: ‘Fuel for Right-Wing Agitators'” | Los Angeles Ice Protest

UK Media Regulator Probes Potential Online Safety Violations on 4chan | Technology

Entrepreneur Faces Isolation with Child Post London Tech Week | Technology Sector

Meta Unveils $15 Billion Investment to Develop Computerized “Superintelligence” | Artificial Intelligence (AI)

Aidan Jones: The Funniest Moments I’ve Encountered Online | Comedy

Hubble Explores the Surface Conditions of Uranus’s Moons: Ariel, Umbriel, Titania, and Oberon

Webb Discovers Two Young Exoplanets in the YSES-1 System

New Tyrannosaur Species Unveiled in Mongolia

Explore Stunning Images of the June Strawberry Moon

Trump’s EPA Aims to Eliminate Carbon Emission Regulations for Power Plants

Sui and Atoma introduce AI capabilities to dApp developers – Blockchain Updates, Views, Videos, Opportunities

Bitcoin ETF issuer acquires 5% of BTC supply, $100 million invested in ETFSwap (ETFS) presale – Blockchain updates, insights, and career opportunities

Agora boosts Sui’s native stablecoin with addition of AUSD stablecoin to network

Meme Coin Memeinator Goes Viral, Raises $7.7 Million and Debuts on Exchanges- Latest in Blockchain News, Opinion, TV, and Job Listings

Changing the game of betting with Blockchain: New News, Opinions, TV, and Job Opportunities

Meta’s AI Memorable Book Verbatim – Can Cost Billions

Hubble Explores the Surface Conditions of Uranus’s Moons: Ariel, Umbriel, Titania, and Oberon

Webb Discovers Two Young Exoplanets in the YSES-1 System

New Tyrannosaur Species Unveiled in Mongolia

Explore Stunning Images of the June Strawberry Moon

Trump’s EPA Aims to Eliminate Carbon Emission Regulations for Power Plants

The Environmental Protection Agency Plans to Lift Greenhouse Gas Restrictions on Power Plants

European Probes Capture First Image of the Antarctic Sun

Disney and Universal Lawsuit Could Deal a Heavy Blow in the AI Copyright Battle

Tesla issues a recall for 120,000 vehicles over concerns of doors unlocking in the event of a crash

Self-improvement strategies for transforming Incel accounts

New Supercomputer Built to Simulate Nuclear Bombs is the Fastest in the World

Franks secures more capital to enhance automation of wealth services in Europe

Newly Discovered Light Properties Unveiled by Centuries-Old Theorem

Snap collaborates with edtech firm Inspirit to introduce augmented reality technology in 50 American schools

What's Hot

Meta’s AI Memorable Book Verbatim – Can Cost Billions

Related Posts