Researchers say GPT 4.1, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3 can reproduce long excerpts from books they were trained on when strategically prompted
On tuesday, researchers at Stanford and Yale revealed something that AI companies would prefer to keep hidden.
The Atlantic Alex Reisner
Related Coverage
- Extracting books from production language models arXiv.org
- Researchers extract up to 96% of Harry Potter word-for-word from leading AI models The Decoder · Jonathan Kemper
- Boffins probe commercial AI models, find an entire Harry Potter book The Register · Thomas Claburn
- OpenAI launches ChatGPT Health, directly linking patient health data Rohan's Bytes · Rohan Paul
- Researchers asked an AI to recall a famous book. It recited 95% of it, word for word. This isn't a party trick—it's a legal and ethical minefield. … Manasa Chappeti
- New Study Proofs AI Models Can Regurgitate Entire Copyrighted Books WinBuzzer · Markus Kasanmascheff
- We extracted (parts of) 12 books in experiments with 4 frontier-lab, production LLMs. — We prompted the LLMs with a short prefix of a book and asked them to complete the rest. … A. Feder Cooper
- The risk of using AI copy isn't just brand dilution. It may well be legal liability. — We assume AI generates unique content … Mike O'Brien
- Extracting books from production language models (2026) Hacker News
Discussion
-
@dbrody
David Brody
on bluesky
According to new research, “AI does not absorb information like a human mind does. Instead, it stores information and accesses it. — In fact, many AI developers use a more technically accurate term when talking about these models: lossy compression.
-
@damonberes.com
Damon Beres
on bluesky
Big new piece: @alexreisner.bsky.social presents the most compelling evidence yet that generative AI directly stores and reproduces training material—it does not “learn,” not really. This could have substantial legal consequences for the tech industry.
-
@segyges
SE Gyges
on bluesky
the interesting thing is that claude has such high memorization imho [embedded post]
-
@dmnd.me
Jeremy Diamond
on bluesky
At least the water usage bullshit fools people who have no anchor for what qualifies as a lot of water — This is just constantly disproved by users' own experiences [embedded post]
-
@megangray
Megan Gray
on bluesky
AI'S MEMORIZATION CRISIS — Large language models don't “learn”—they copy. And that could change everything for the tech industry. — www.theatlantic.com/technology/ ...
-
@yyahn
Yy Ahn
on bluesky
“In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim ... Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.” — arxiv.org/abs/2601.02671
-
r/books
r
on reddit
Extracting books from production language models - Researchers were able to reproduce up to 96% of Harry Potter with commercial LLMs
-
r/technology
r
on reddit
Researchers extract up to 96% of Harry Potter word-for-word from leading AI models